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Abstract 



What are the information processing capabilities of physical systems? 

As recently as the first half of the 20*^^ century this question did not even have a definite 
meaning. What is information, and how would one process it? It took the development of 
theories of computing (in the 1930s) and information (late in the 1940s) for us to formulate 
mathematically what it means to compute or communicate. 

Yet these theories were abstract, based on axiomatic mathematics: what did physical systems 
have to do with these axioms? Rolf Landauer had the essential insight — "Information is 
physical" — that information is always encoded in the state of a physical system, whose dynamics 
on a microscopic level are well-described by quantum physics. This means that we cannot discuss 
information without discussing how it is represented, and how nature dictates it should behave. 

Wigner considered the situation from another perspective when he wrote about "the unreason- 
able effectiveness of mathematics in the natural sciences" . Why are the computational techniques 
of mathematics so astonishingly useful in describing the physical world [P? One might begin to 
suspect foul play in the universe's operating principles. 

Interesting insights into the physics of information accumulated through the 1970s and 1980s 
— most sensationally in the proposal for a "quantum computer" . If we were to mark a particular 
year in which an explosion of interest took place in information physics, that year would have to 
be 1994, when Shor showed that a problem of practical interest (factorisation of integers) could be 
solved easily on a quantum computer. But the applications of information in physics — and vice 
versa — have been far more widespread than this popular discovery. These applications range 
from improved experimental technology, more sophisticated measurement techniques, methods 
for characterising the quantum/classical boundary, tools for quantum chaos, and deeper insight 
into quantum theory and nature. 

In this thesis I present a short review of ideas in quantum information theory. The first chapter 
contains introductory material, sketching the central ideas of probability and information theory. 
Quantum mechanics is presented at the level of advanced undergraduate knowledge, together with 
some useful tools for quantum mechanics of open systems. In the second chapter I outline how 
classical information is represented in quantum systems and what this means for agents trying 
to extract information from these systems. The final chapter presents a new resource: quantum 
information. This resource has some bewildering applications which have been discovered in the 
last ten years, and continually presents us with unexpected insights into quantum theory and 
the universe. 
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CHAPTER 1 
Prolegomenon 



Information theory and quantum mechanics form two cornerstones of an immense construction of 
technology achieved in the 20**^ century. And apart from being highly successful in their respec- 
tive realms - and indeed in the communal realm of computing - they have recently interacted in 
a way their discoverers hadn't dreamed of. By stretching the envelope of their specifications, the 
field of quantum information theory was born: quantum mechanics by asking questions about 
how small computers could be made, and how much energy is required for their operation; and 
information theory by asking how the abstract notion of a logical bit is implemented in the 
nuts-and-bolts world of physics. 

Of course, before we begin our exploration of quantum information, we require knowledge of 
the two theories on which it is based. This first chapter provides a basis for the definitions of 
information theory and some demonstration as to why these notions are appropriate, and then 
lays the groundwork for the style of quantum mechanics required for later developments. 

1.1 Information Theory 

Information is such a rich concept that trying to pin it down with a definition amputates 
some of its usefulness. We therefore adopt a more pragmatic approach, and ask questions which 
(hopefully) have well-defined quantitative answers, such as "By what factor can given informa- 
tion be compressed?" and "How much redundancy must be incorporated into this message to 
ensure correct decoding when shouted over a bad telephone connection?" . The answers to these 
questions are given by a small mimbcr of measures of information, and the methods of answering 
often yield valuable insights into the fundamentals of communication. 

But first: what are the basic objects of communication, with which the theory deals? For this 
we consider a simple canonical example. Suppose we have to set up a communication link from 
the President to Defence Underground Military Bombers (DUMB) from where nuclear weapons 
are launched. When the President wakes up in the morning, he cither presses a button labelled 
Y (to launch) or a button labelled N (to tell DUMB to relax). This communication channel 
requires two symbols which constitute an alphabet, but in general we could envisage any number 
of symbols, such as 256 in conventional ASCII. As 21^* century historians, we may be interested 
in the series of buttons pushed by the President in 1992. We hope that he pushed N significantly 
more times than he pushed Y, so it may be more economical to store in our archive the sequence 
of integers Ni,... ,Nk where Ni is the number of N's between the {i — 1)^* and i^^ destructive 
Y's. From this toy model we learn that our mathematical notion of information should involve an 
alphabet and a probabilMy distribution of the alphabet symbols. However it also seems desirable 
that the information be invariant under changes in the alphabet or representation - we don't 
want the amount of information to change simply by translating from the set {Y, N} to the 
set {0, 1, . . . , 365}. The probability distribution seems to be a good handle onto the amount of 
information, with the proviso that our measure be invariant under these "translations" . 

From this example we also note a more subtle point about information: it quantifies our 
ignorance. A sequence that is completely predictable, about which there is complete knowledge 
and no ignorance, contains no information. If the president's actions were entirely a function of 
his childhood there would be no point in storing all the Y's and N's, or indeed for DUMB to 
pay attention to incoming signals - we could calculate them from publicly available knowledge. 
For a sequence to contain information there must be a degree of ignorance about future signals, 
so in a sense a probability is the firmest grip we can get on this ignorance. 
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Hence we interrupt our development of information for a sojourn into probability theory. 

1.1.1 Notions of Probability 

We will consider our alphabet to be a set ^ = {ai,a2, ■ ■ ■ ,an} of n symbols. The notion of 
probability we employ here, the Bayesian approach, is closely tied to the concept of information. 
Informally a probability p(aj) is a measure of how much we'd be willing to bet on the outcome of 
a trial (which perhaps tells us which symbol to transmit) being Oj |2|. Clearly this will depend 
on our subjective knowledge of how a system was prepared (or perhaps which horse has been 
doped), and explains the popularity of card counting in Blackjack. We begin with an event H 
of interest, which for the sake of definiteness we specify as "The next card dealt will be an Ace" 
and prior knowledge that the deck is complete and well-shufSed with no wildcards. In this case, 
our understanding of the situation tells us that 

p{H) = 1/13. (1.1) 

Our prior knowledge in this case is implicit and is usually clear from the context. There are 
situations in which our prior knowledge might change and must be explicit, as for example when 
we know that a card has been removed from our shuffled pack. How much we'd be willing to bet 
depends on the value of that card; we then use the notation 

3 4 
p(H I Ace removed) = — and p{H \ Other card removed) = — (1-2) 

51 51 

to demonstrate the dependence, assuming that all other prior knowledge remains the same. 

To make these ideas more formal, we consider a set A (of signals, symbols or events), and we 
define a probability measure as a function p from the subsets^ of A to the real numbers satisfying 
the following axioms^: 

1. p((/)) = (probability of null event) 

2. p{A) = 1 (probability of certain event) 

3. For any MCA, p{M) > 

4. For a,b ^ A, p{a) + p{b) = p{a, h) (probability of disjoint events is additive)^. 

This formalism gives a mathematical structure to the "plausibility of a hypothesis" in the presence 
of (unstated) prior knowledge. Happily this machinery also coincides in cases of importance 
with the frequency interpretation of probability, which allows us to employ counting arguments 
in calculating probabilities in many situations. Because of the last requirement above, we can 
specify a probability measure by giving its value on all the singleton subsets a of A. In this 
case we typically write p{{a}) = p{a) where no confusion can arise, and even this is occasionally 
shrunk to pa- We also use the terms "measure" and "distribution" interchangeably - the former 
simply being more abstractly mathematical in origin than the latter. 

We will frequently be interested in a random variable, defined as a function from A to the real 
numbers. For example, if the sample space ^ is a full pack of cards then the function X which 



^In more generality, a probability measure is defined on a a-algebra of subsets of ^4 This allows us to extend 
this description of a probability to continuous spaces. 

^These axioms can be derived from some intuitive principles of inductive reasoning; see e.g. Q and 1^. 

^When p is defined on a cr-algebra, we demand that p be additive over countable sequences of pairwise disjoint 
subsets from the cr-algebra. 
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takes "Spades" to 1 and other suits to counts the number of Spades; we write 

Prob(X = 1) =p(l) = 1/4 Prob(X = 0) = p(0) = 3/4 (1.3) 

A function of interest might then he F = Xi where each Xi is one of these "Spade-counting" 
random variables; in this case F is defined on the space A x . . . x A and Xi is a random variable 
on the V space. The expectation value of a random variable X is then defined as 

EpX = Y,Pici)X{a) (1.4) 

aeA 

where the subscript makes explicit the distribution governing the random variable. This is just 
the first of a host of quantities of statistical interest, such as mean and variance, defined on 
random variables. 

If we have two sample spaces, A and B, we can amalgamate them and consider joint probabil- 
ity distributions on AxB. The probability measure is then specified by the singleton probabilities 
p{a, b) where a G A and b £ B. By axiom ^ above, we have that 

E p{a,b) = l. (1.5) 

a&A, b&B 

If for each a £ A we define PAia-) = "^beB fl^' ^) ^^'^ similarly psib) = X]aeA^'('^' ^) then pA and 
Pb are also probability measures, on the spaces A and B respectively; these measures are called 
the marginal distributions. Conventionally we drop the subscripts on the marginal distributions 
where confusion cannot arise. But notice that these measures are not necessarily "factors" of the 
joint probability, in that p{a,b) 7^ pA{a)pB{b) for all a G ^4,6 G B^. This prompts us to define 
the conditional probability distributions 

p{a,b) p{a,b) p{a,b) , . 



These definitions lend rigour to the game of guessing Aces described by Eqn |1.2| . Note in this 
definition that if we choose a fixed member b from the set B, then the distribution pb{a) = p{a\b) 
is also a well-defined distribution on A. In effect, learning which signal from the set B has 
occurred gives us partial knowledge of which signal from A will occur — we have updated our 
knowledge, conditional on b. 

This definition can quite easily be extended to more than two sample spaces. If Ai, . . . ,An 
are our spaces and p is the joint probability distribution on j4 = J| ylj, then, for example, 

. I X P(«l, • • • ,«n) P{ai, • • • ,an) 
p{anan-i\an-2,--- ,ai) = — 7 = ^?^ 7 7 (!•') 

p{ai, ... , an-2) l.x&A„,y&Ar.-i • • • ' «n-2, y, x) 

is the probability of sampling the two symbols a^-i and a„ given the sampled sequence ai, . . . , 0^-2- 

Conditional probabilities give us a handle on the "inverse probability" problem. In this 
problem, we are told the outcome of a sampling from the set A (or perhaps of several identically 
distributed samplings from the set A and asked to "retrodict" the preparation. We might perhaps 
be told that one suit is missing from a pack of cards and, given three cards drawn from the smaller 



■'Those events for which it is true that p(a,h) — pA{a)pB{b) are called independent, and if true for all events 
the distribution is called independent. 
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pack, asked to guess which suit this is. The tool we should use is Bayes' Theorem, 

pia\b)p{b) Pia\b)p{b) 

p{ba) = = / I ^ / X 1-^ 



which is a simple consequence of p{a\b)p{b) = p{a,b) = p{b\a)p{a). In applying Eqn |Lq , we 
typically have knowledge of p{b) — the probability that each suit was removed — or if we don't, 
we apply Bayes ' postulate, or the "principle of insufficient reason" , which says in the absence 
of such knowledge we assume a uniform distribution|^ over B. A knowledge oi p{a\b) comes from 
our analysis of the situation: If all the Clubs have been removed, what is the probability of 
drawing the three cards represented by a? Once we have calculated p{a\b) from our knowledge of 
the situation, and obtained the "prior" probabilities p(b) from some assumption, we can use Eqn 



1.8 to calculate the "posterior" probability p{b\a) of each preparation b based on our sampling 



result a. 



Stochastic processes We will later characterise an information source as a stochastic process 
and this is an opportune place to introduce the definition. A stochastic process is defined as an 
indexed sequence of random variables [^] from the same symbol set A, where we may imagine 
the index to refer to consecutive time steps or individual characters in a sequence produced by 
an information source. There may be an arbitrary dependence between the random variables, so 
the sequence may be described by a distribution Prob(Xi = xi, . . . , Xn = x„) = p{xi, . . . , Xn)- 
A stochastic source is described as stationary if the joint distribution of any subset of symbols 
is invariant with respect to translations in the index variable, i.e. for any index s 

p{xi, ... ,Xn) = p{Xs+l, • • • , Xs+n)- (1-9) 



Example 1.1 The weather 

The assumption behind most forms of weather forecasting is that the weather operates as a 
stochastic process. Thus if we know the vector of variables like temperature, wind speed, air 
pressure and date for a series of days, we can use past experience to develop a probability 
distribution for these quantities tomorrow (except for the date, which we hope is deterministic). 

On the other hand, over a much longer time scale, the Earth's climate does not appear to be 
stochastic. It shows some sort of dynamical behaviour which is not obviously repetitive, and so 
a probability description is less appropriate. • 



1.1.2 Information Entropy 

Our aim in this section is to motivate the choice of the Shannon entropy^, 

Hip) = -Y,PiX = ^i)'^ogp{X = Xi) (1.10) 

i 

as our measure of the information contained in a random variable X governed by probability 
distribution p Q There are in fact dozens of ways of motivating this choice; we shall mention a 

^There is some ambiguity here since a uniform distribution over x'^ is not uniform over x; see j^. This is a 
source of much confusion but is not a major obstacle to retrodiction. 

^Throughout this thesis, logarithms are assumed to be to base 2 unless explicitly indicated. 

^ H is a functional of the function p, so the notation in Eq is correct. However, we frequently employ the 
random variable X as the argument to H; where confusion can arise, the distribution will be explicitly noted. 
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Figure 1.1: The unexpectedness of an event of probability p. 

few. 

As a first approach to Shannon's entropy function, we may consider the random variable 
u{x) = — logp(x), which is sketched in Figure Intuitively we may justify calling u the 
unexpectedness of the event x; this event is highly unexpected if it is almost sure not to happen, 
and has low unexpectedness if its probability is almost 1. The information in the random variable 
X is thus the "eerily self-referential" Q expectation value of X's unexpectedness. 

Shannon ||9[ formalised the requirements for an information measure H{pi, ... with the 
following criteria: 

1. H should be continuous in the pi. 

2. If the Pi are all equal, pi = 1/n, then H should be a monotonic increasing function of n. 

3. H should be objective: 

/ Pi P2 \ 

H{pi, . . . ,Pn) = H{pi+P2,P3, . . . ,Pn) + {Pl +P2)H[ ■ , ■ (1.11) 

\Pl +P2 PI+P2J 

The last requirement here means that if we lump some of the outcomes together, and consider 
the information of the lumped probability distribution plus the weighted information contained 
in the individual "lumps", we should have the same amount of information as in the original 
probability distribution. In the mathematical terminology of Aczel and Daroczy [ p!o[ , the entropy 
is strongly additive. Shannon proved that the unique (up to an arbitrary factor) information 
measure satisfying these conditions is the entropy defined in Eqn [1.10 . Several other authors. 



notably Renyi, and Aczel and Daroczy have proposed other criteria which uniquely specify 
Shannon entropy as a measure of information. 

The arbitrary factor may be removed by choosing an appropriate logarithmic base. The most 
convenient bases are 2 and e, the unit of information in these cases being called the bit or the 
nat respectively; if in this discussion the base of the logarithm is significant it will be mentioned. 

Some useful features to note about H are: 

• H = if and only if some Pi = 1 and all others are zero; the information is zero only if we 
are certain of the outcome. 

• If the probabilities are equalised, i.e. any two probabilities are changed to more equal values, 
then H increases. 
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• If p{a, b) = pA{a)pB{b) for all a £ A,b £ B then 

H{A,B) = - ^ p{x,y) log pA{x)pB{y) 

= - XI P(.x^y)'^ogpA{x) - X p{x, y) log pB{y) 

xeA,y£B xeA,yeB 

= -^PA(.x)logpA{x) -J2PB(.y)^OgpB{y) 
xGA y£B 

= H{A) + H{B) 

We will be interested later in the information entropy of a more general distribution on 
a product sample space. So consider the information contained in the distribution p{a, b) on 
random variables X and Y, where p is not a product distribution: 

H{A,B) = - X p{x,y) log p{x,y) 

xGA,yeB 

= - X] P(.^^y)'^ogp{y\x)p{x) 

xeA,yeB 

= - X P(.^^y)^OSP(.y\^) -'^PA{x)logPA{x) 

x<^A,y<^B xeA 

= H{B\A)+H{A) (1.12) 
where we have defined the conditional entropy as 

H{B\A) = - J2 P{x, y) '^og p{y\x) = ^p{x)^ p{y\x) log p{y\x). (1.13) 

xeA,y&B xeA y&B 

Note that, for fixed x £ A, p{y\x) is a probability distribution; so we could describe H(Y\x) = 
^^g^p(y|x) logp(y|x) as the x-based entropy. The conditional entropy is then the expectation 
value of the x-based entropy. 

Using the concavity of the log function, it can be proved that H{X, Y) < H{X) + H{Y): the 
entropy of a joint event is bounded by the sum of entropies of the individual events. Equality 
is achieved only if the distributions are independent, as shown above. From this inequality and 



Eqn |1.12| we find 

H{Y\X) < H(Y) (1.14) 

with equality only if the distributions of Y and X are independent. In the case where they are 
not independent, learning about which value of x was sampled from the set A allows us to update 
our knowledge about what will be sampled from B, so our conditioned uncertainty (entropy) is 
less than the unconditioned. We could even extend our idea of conditioned information to many 
more than just two sample spaces; if we consider the random variables Xi, . . . ,Xn defined on 
spaces Ai, . . . , ^„ , we can define 

H{Xn\Xn-l,... ,Xi) =-Exi...x„ P(X1, . . . ,X„„i) X 

piXn\xi,... ,X„_,i)logp(x„|xi,... ,Xn-l) 

n 
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Figure 1.2: The entropy of a binary random variable. 



to be the entropy of the next sampling given the sampled sequence once we know the preceding 
n — 1 samples. By repeated application of the inequality above, we can show that 

H{Xn\Xn-l,... ,Xi)<H{Xn\Xn-l,... , X2) < . . . < H {Xn\Xn^l) < H (Xn) . (1.15) 

In general, conditioning reduces our entropy and uncertainty. 

We will now employ the characterisation of an information source as a stochastic process 
as mentioned earlier. Consider a stationary stochastic source producing a sequence of random 
variables Xi,X2, ■ ■ ■ and the "next-symbol" entropy H{Xn^i\Xn, ■ ■ ■ ,Xi). Note that 

H{Xn+l\Xn, ■ ■ ■ ,Xi) < H{Xn+i\Xn, ■ ■ ■ ,X2) 

= H{Xn\Xn~l, ■ ■ ■ ,Xi) 

where the equality follows from the stationarity of the process. Thus next-symbol entropy is a 
decreasing sequence of non- negative quantities and so has a limit. We call this limit the entropy 
rate of the stochastic process, H{X). For a stationary stochastic process this is equal to the 
limit of the average entropy per symbol, 



n 



which is a further justification for calling this limit the unique entropy rate of the stochastic 
process. 

Example 1.2 Entropy rate of a binary channel 

Consider a stochastic source producing a random variable from the set {0,1}, with p{Q) = 
p,p(l) = 1 — p. Then the entropy is a function of the real number p, given by H{p) = 



-p\ogp — (1 — log(l — p). This function is plotted in Figure L2. Notice that the func 



tion is concave and achieves its maximum value of unity when p = 1/2. • 

It was mentioned previously that entropy is a measure of our ignorance, and since we interpret 
probabilities as subjective belief in a proposition this "ignorance" must also be subjective. The 
following example, due to Uffink and quoted in p], illustrates this. Suppose my key is in my 
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pocket with probability 0.9 and if it is not there it could equally likely be in a hundred other 
places, each with probability 0.001. The entropy is then -0.9 log 0.9 - 100(0.001 log 0.001) = 
0.7856. If I put my hand in my pocket and the key is not there, the entropy jumps to — log 0.01 = 
4.605: I am now extremely uncertain where it is! However, after checking my pocket there will 
on average be less uncertainty; in fact the weighted average is — 0.9 log 1 — 0.1 log 0.01 = 0.4605. 

1.1.3 Data Compression: Shannon's Noiseless Coding Theorem 

What is the use of the "entropy rate" of a stochastic source? If our approach to information 
is to be pragmatic, as mentioned previously, then we need to find a useful interpretation of this 
quantity. 

We will first consider a stochastic source which produces independent, identically-distributed 
random variables Xi , X2, . . . , Xn- A typical sequence drawn from this source might he X1X2 ■ ■ ■ Xn 
with probability p{xi, . . . = p{xi) . . .p{xn)- Taking logarithms of both sides and dividing 
by n, we find 

-logp(xi,... ,Xn) = -y^\ogp{xi) (1.17) 

and by the law of large number^ the quantity on the right approaches the expectation value of 
the random variable \ogp{X), which is just the entropy H{X). More precisely, if we consider 
the set 

4") = {(xi, . . . , x„) G ^|2-"(^(^)+^) < p(xi, ...,xn)< 2-«(^(^)-^)} (1.18) 
then the following statements are consequences of the law of large numbers: 

1. Prob(j4e"^) > 1 — e for n sufficiently large. 

2. > (1 — e)2^^^^-^^~''^ for n sufficiently large, and | • | denotes the number of elements 
in the set. 

3. <2"(^W+^). 

Thus we can find a set of roughly 2^^^-^^ strings (out of a possible \A\^, where A is the set 
of possible symbols) which all have approximately the same probability, and the probability of 
producing a string not in the set is arbitrarily small. 

The practical importance of this is evident once we associate, with each string in this "most- 
likely" set, a unique string from the set {0,1}"^^^^. We thus have a code for representing 
strings produced by the stochastic source {Xi} which makes arbitrarily small errors^, and which 
uses nH(X) binary digits to represent n source symbols. We thus have the extremely useful 
interpretation of H{X) as the expected number of bits required to represent long strings from 
the stochastic source producing random variables {Xi}, in the case where each random variable 
in the set is independently distributed. 

We have not quite done all we claimed in the previous paragraph, since we have ignored the 
possibility of a more compact coding strategy. Let the strings from A" be arranged in order of 

*The law of large numbers states that if a;i , . . . , a;jv are independent identically-distributed random variables 
with mean x and finite variance, then P{\jf Xi — x\ > 5) < e for any 5, e > 

'For the rare occasions when an unlikely string x\ . . .Xn occurs, it does not matter how we deal with it: if our 
system can tolerate errors we can associate it with the string ... 0; if not, we can code it into any unique string 
of length << l/p{xi, . . . ,x„). 
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decreasing probability, and let N(q) be the number of strings, taken in this order, required to 
form a set of total probability q. Then Shannon proved the remarkable result that for q ^ 0,1 

hm = H. (1.19) 

n^oo n 

For large n it makes no difference how we define "probable" : all probable sets contain about 2"^ 
elements! 

Note also that if our strings are produced by an independent binary source with p{0) 7^ 1/2 
then H{p) < 1 (see Ex 0). Thus our code strings are shorter than the source strings — we have 
compressed the information. The resulting code strings will have p{0) ~ p(l) ~ 1/2 and entropy 
rate close to unity. 

We will now look at what changes when a fully stochastic source is considered. Instead 
of the random variables Xi, . . . ,X„ being independent, they are described by a probability 
distribution p(xi,... From the considerations above, we can design a code with string 

lengths l{xi . . . x„) such that the expected length per symbol L = ^ "^pixi, • • • , Xn)l{xi . . . x„) 
satisfies 



H{Xi, . . . ,Xn) _ ^ 



n 



< e. (1.20) 



If our stochastic source is stationary, then H{Xi, . . . , Xn)/n approaches the entropy rate H{X). 
Thus in this case too, the entropy rate describes the shortest code available for a particular source. 
A provably optimal — that is, shortest — code can be found using an algorithm discovered by 
Huffman Q, and the communication theorist now has an enormous variety of codes to choose 
from to suit his application. 

For reference. Shannon's first theorem in full generality is given below. 



Theorem 1.1 (Shannon I) 

Let C : A" B be a code with binary codewords, and suppose C has the property that no code 
word is the prefix of another code word. For each x G A" we denote the length of the codeword 
C(x) by Zc(x) and define Lq = ^ ^p(x)/c(x), the expected code word length per source symbol. 
Suppose the source is stochastic. Then there exists a code C such that 

g(Xi,... ,X„) ^ ^^^H{X,,... ,X„) ^ 1^ 
n n n 

An arbitrarily small error rate can be achieved if and only if the first inequality is satisfied. 



We can also use Shannon's theorem to interpret the conditional entropy 

H{X\Y) = H{X, Y) - H{Y). (1.22) 

The right-hand side of this equation may be viewed as the number of bits required to code X 
and Y together, less the number of bits required to specify Y alone. The difference must surely 
be the average number of bits required to code X once Y is known - one could perhaps envisage 
using a different code for each random variable Y already in our possession. This interpretation 
will be important for considerations in the next section. 
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1.1.4 Information Channels and Mutual Information 

Suppose we have a microphone on a stage, and the President is speaking into it. The mi- 
crophone will be hooked up to a loudspeaker system so that the assembled throng will be able 
to hear him. The President in this situation is an information source, stochastically producing 
words (or more generally sounds) near the microphone. Considered entirely separately, the loud- 
speaker is also a stochastic information source. If the technical crew have done their job there 
should not only be a correlation between the output of the loudspeaker, there should be a one-to- 
one correspondence. Another way of saying this is that if we know the sounds produced by the 
President, we should be absolutely certain of what sounds will be produced by the loudspeaker. 
A mathematical way of expressing this is 

i7(loudspeaker|President) = : (1-23) 

the uncertainty (entropy) once we know the President's words should be zero. Of course a real 
microphone-amplifier-loudspeaker (MAL) system does introduce some errors, so in general the 
conditional entropy above may be some small positive amount. A measure of the fidelity of the 
MAL system could then be the reduction in entropy once we know the original information: 

/(loudspeaker : President) = //(loudspeaker) — //(loudspeaker] President). (1-24) 

If somebody accidentally unplugged the microphone from the amplifier, then there would be 
no correlation between the President's speech and the sound produced by the loudspeaker and 
these could be considered to be independent sources, in which case //(loudspeaker|President) = 
//(loudspeaker) and 

/(loudspeaker : President) = 0. (1-25) 

The quantity defined in Eqn |1.24| is called the mutual information between the President and 
the loudspeaker. In a general setting we would have two stochastic sources X and Y with a given 
joint distribution p{x, y) (which could in fact be a distribution over n-tuples of symbols from X 
and Y). The mutual information would then be 

I{X : Y) = H{X) - H{X\Y) (1.26) 
= H{X)-HiX,Y) + H(Y) 

= H{Y)-H{Y\X) (1.27) 



where we have used Eqn 1.12| to obtain the second equality. In this context, H{X\Y) is sometimes 



referred to as the equivocation of the channel. 

Before continuing to the application and interpretation of the mutual information, we note 
a few mathematical features of this function. We note first the pleasing symmetry I{X : Y) = 
I{Y : X)] the amount of information we gain about X when we learn Y is the same as the amount 
of information gained about Y on learning X. Notice that, because < H{X\Y) < H(X), the 
mutual information is always non- negative and always less than the entropies H{X) and H(Y); 
the mutual information is zero if and only if the distributions on X and Y are independent. 
Finally observe that H(X\X) = 0, whence I{X : X) = H{X) — the self-information of a source 
is equal to its information entropy. 

The mutual information is in fact a special case of another function, the relative information 
(otherwise known as Kullback-Leibler distance]^ between two distributions, which we mention 

^"The Kullback-Leibler "distance" is not in fact a metric: it is clearly not symmetric and doesn't satisfy a 
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here for completeness. The relative information between two probabihty distributions p{x) and 
q{x) is defined to be 



pix) 

P{x) 
q{x)' 



D{p\\q) = T.Pix)log^^ (1.28) 

x€X ^^^> 

= Ep log 



Note that the relative information is not symmetric in p and q. In fact D has several very useful 
interpretations, notably as the expected number of bits over and above the H{p) required by 
Shannon's theorem if the code you are using is optimised for the non-occurring distribution q. It 
is also easy to see that the probability of observing a string xiX2 ■ ■ ■ Xn from a source producing 
independent, identically-distributed symbols with distribution p{xi) is related to the distance 
between the observed distribution and the real distribution. If we let q{ai) = ^ be the empirical 
distribution drawn from the alphabet {ai, . . . , a^} then 

n N 

p{x,x2...xn) = n?'(^^)=n^'(«^r'^"^^ 

i=l j=l 

N 

= Yiexp[nq{aj)\ogp{aj)] 

/ A. 

= exp n^[g(aj) logp{aj) - q{aj) logg(aj) + q{aj) \ogq{aj 

V 3=1 

= e^p{-n[H{q)+D{q\\p)]). 

The mutual information is seen to be the relative information between the product distribution 
of X and Y and the true joint distribution: 

= D{p{x,y)\\p{x)p{y)) 

and so is a measure of the correlation between X and Y, that is, the extent to which they differ 
from being independent. 

As discussed in the previous section, the conditional entropy H{X\Y) is a measure of the 
expected number of bits required to code X once we know the value of Y; the mutual information 
thus quantifies the information about X conveyed by Y and vice- versa, and lends a more rigorous 
interpretation to the mutual information than the heuristic explanation above. It also leads us 
directly to the idea of an information channel. 

We characterise an information channel by the mistakes it makes, and this knowledge is 
generally derived from an understanding of the physical system used to convey the information. 
For example in the MAL system described above, the response of the amplifier will be frequency 
dependent and perhaps have an upper and lower cutoff; the loudspeaker in turn will also have a 
characteristic response, all of which will lead to hissing, random noise, feedback and other such 
unwanted effects. If we assume that the alphabet under consideration is a sequence of phonemes 
spoken by the President (we could alternatively analyse the spectrum of his voice), then what 

triangle inequality. 
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his PR people will be interested in is the probability that a given phoneme is produced by the 
loudspeaker when the President utters another given phoneme. The intermediate steps don't 
interest them; what they care about is p(lspk="ch" |Pres="sh") = 0.1, because this substitution 
could be damaging. 

More rigorously, an information channel is characterised by an input alphabet X, an output 
alphabet Y and the transition probabilities p{y\x) that describe the probability of the input 
symbol x being turned into output symbol y. For a given probability distribution over the input 
symbols X, we can calculate the mutual information (per symbol) between the input and output 
— and if we further assume that the channel can accept r input symbols per unit time, then we 
begin to see an interesting problem before us: What is the fastest rate at which information can 
be conveyed across this channel? And can we transmit information with arbitrarily few errors 
despite the introduction of probabilistic errors by the channel? 

These questions will take us to the heart of classical information theory. But to jump the gun 
a bit: The answer to the second question is Yes, and transmission without errors can take place 
at the rate 



C = max/(X:y) (1.30) 

p{x) 

bits per symbol. This is surprising, since one would imagine that we could either transmit rapidly 
or transmit faithfully, but not both at the same time. That this is so is the content of Shannon's 
Second Theorem, the Noisy Coding Theorem, which will be the subject of the next section. 
The quantity defined in Eqn 1.30| is called the capacity of the channel]^. In most cases of 



interest this can not be calculated explicitly, but some examples serve to illustrate the idea of 
capacity. 



Example 1.3 Noiseless and useless channels 

If we have a noiseless binary channel, so that p(0|0) = 1 and p(l|l) = 1 then the maximum possi- 
ble output entropy is 1 bit per symbol and this capacity is achieved if we simply ensure that the 
source probabilities are p(0) = p{l) = 1/2. On the other hand, if all the transition probabilities 
are equal to 1/2 then we can never hope to transmit any information. • 

Example 1.4 Binary symmetric channel 

Suppose a channel transmitting binary signals has probability p of flipping each bit, indepen- 
dently of other bits or of the particular value of this bit. By the symmetry of the errors, we 
observe that to maximise the mutual information we should set p(0) = p(l) = 1/2, so that 
(transmitted) = 1. If the channel output is a 1, then Bob knows p(l|0) = p and p(l|l) = 1 — p, 
so he calculates 



C = 1 - H{p) (1.31) 
where H is the binary entropy function plotted in Figure 1.2. • 



Example 1.5 Ternary channel 

Suppose we have a channel with three symbols as depicted in Figure L2 , where one symbol is 
transmitted without error, and the other two symbols 1 and 2 are interchanged with probability 
p. By symmetry, the capacity-achieving input source should have p(0) = P, p{l) = p{2) = Q. 

^^The mutual information is continuous over the probability simplex, and this simplex is compact, so we are 
justified in calling this the maximum in place of supremum. This also implies that the maximum is attained. 
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Figure 1.3: The action of an uncertain ternary channel. 



The mutual information will then be 

I = -PlogP -2Q\ogQ -2Qa 



(1.32) 



where a = H{p) = —p logp — (1 —p) log(l —p) is the noise due to the channel. We incorporate the 
constraint P + 2Q = 1 with a Lagrange multiplier; we must maximise U = —P log P — 2Q log Q — 
2Qa + X{P + 2Q), whenccj^ 



dP 



dU 
dQ 



-l-logP + A = 
-2 - 21ogg - 2a + 2A = 0. 

Eliminating A we find log P = log Q + a or P = Qe". Thus 

1 -I- 2 



+ 2 



e" + 2 



(1.33) 



The channels we have considered here are described as memoryless. For the brave-hearted 
and strong-willed out there, one can also consider sources with memory i.e. where the error 
process can be considered as a stochastic process depending on arbitrarily many previous input 
and output symbols. Memoryless channels are, fortunately, the rule in situations of interest; and 
most examples of stochastic noise can be approximated by memoryless channels transmitting 
large symbol blocks. 

1.1.5 Channel Capacity: Shannon's Noisy Coding Theorem 

The fundamental idea Shannon employed in showing that information can be transmitted 
reliably over a noisy channel was to allow a small probability of error, which goes to zero in some 
limit — in particular, in the limit when we code large blocks of symbols. 

Figure |l.4| is a sketch of the communication system considered here. The stochastic source 
produces symbols (represented by the random variable X) drawn from the alphabet A, and a 
source encoder optimally codes strings of source symbols into strings of channel symbols drawn 
from the set B; this "combined source" is represented by Xs- The channel encoder introduces 
some redundancy into the message by mapping m channel symbols into r channel symbols, with 



Here we assume the logarithm is to the base e. 
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Figure 1.4: Sketch of a communication system. 



r > m; this gives an effective source Xc- The output Yc of the channel is not necessarily a string 
in the range of the code C, and we use a decoding function / to correct these changes. Finally 
a source decoder is applied; an error occurs whenever the random variable X ^ Y. With an 
optimal source encoder, we have H(Xs) = log \ B\ — no redundancy — and H{X) = ^ log 
where \B\ denotes the number of elements in B. The input to the channel then has entropy 
H(Xc) = ^H{Xs). The rate of the channel is defined to be = H(Xc), that is, the number of 
useful bits conveyed per symbol sent. The channel capacity C in this case is defined to be the 
mutual information of the channel symbols, Xg and Yc, maximised over all codes, 

C = max{I{Xs ■.Yc)\C : B"" ^ B''}, (1.34) 

since the code will imply a probability distribution over the output symbols. 
We are now ready to state Shannon's Noisy Channel Coding Theoreixif^. 

Theorem 1.2 (Shannon II) 

If R < C, there exists a code C : B"^ B^ such that the probability of a decoding error for any 
message (string from B'^) is arbitrarily small. Conversely, if the probability of error is arbitrarily 
small for a given code, then R < C. 

Before we sketch a proof of this, we return to the idea of conditional entropy between the 
sent and received messages, H{Xc\Yc). Suppose we have received an output yi ■ ■ ■ Vn'-, then if we 
had a noiseless side channel which could convey the "missing" information N H{Xc\Yc)., we could 
perfectly reconstruct the sent message. This means that there were on average 2^^^-^"^^"^ errors 
which this side channel would allow us to correct, or equivalently that the received string, on 
average, could have originated from one of 2^^^-^"^^"'^ possible input strings. And this last fact 
is crucial to coding, since the optimal code will produce codewords which aren't likely to diffuse 
to the same output string. 

We begin by looking at the first statement in the theorem above. The strategy used by 
Shannon was to consider random codes (i.e. random one-to-one functions from the messages B"^ 
to the code words B') and average the probability of error over all these codes. The decoding 
technique will be to determine, for the received string yi . . . yr, the set of 2'^^^^^"^^^')^^^ most 
likely inputs, which we will call the decoding sphere Ds- We then decode this received string by 
associating it with a code word in its decoding sphere. Suppose without loss of generality that 
the input code word was x^^^; then there are two cases in which an error could occur: 



The proof outline given here was inspired by John Preskill's proof in Lecture Notes for Physics 229: Quantum 
Information and Computation, available at http://www.theory.caltech.edu/~preskill/ph229. 
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1. x^^^ may not be in the decoding sphere; 

2. There may be other code words apart from x^^^ in the decoding sphere. 

Given arbitrary e > and 6 > 0, the probabihty that x^^^ S Ds can be made greater than 1 — e 
by choosing r large. So the probabihty of an error is 

Pe<e + p (x(^) G Ds for at least one i = 2, 3, . . . , . (1.35) 

There are 2*"^ code words distributed among ISI*" = 2"^^^^"^ strings from B"^; by the assump- 
tion of random coding, the probability that an arbitrary string is a code word is 

orR 

p{X is a codeword) = ^jj^ = 2-^(^(^^)-^) (1.36) 

independently of other code word assignments. We can now calculate the probability that 
contains a code word (apart from x^^^): 

p (code word in Ds) = p{X is a codeword) 

= {\Ds\ - 1) 2-''(^(^-)"-^) = 2-^'{^(-^=)~-f^--f^(^-l^-)-'5). (1.37) 

Now H{Xc\Yc) = H{Xs,Xc\Yc) = H{Xc\Xs,Yc) + H{Xs\Yc) = H{Xs\Yc), where the first equality 
follows from the functional dependence of messages Xs on code words Xc, the second follows 
from Eqn 1.12 , and the third follows from the functional dependence of Xc on Xg- Thus we can 
simplify the expression above: 

p (code word in Ds) = 2-'^(^(^-^<=)-^-'5) (1.38) 

and we conclude that the probability of an error goes to zero exponentially (for large r) as long 
as I{Xs : Yc) — 5 > R. If we in fact employ the code that achieves channel capacity and choose 
6 to be arbitrarily small, then the condition for vanishing probability of error becomes 

C>R (1.39) 

as desired. 

Note that we have shown that the average probability of error can be made arbitrarily small: 
-^p^ ^^p(error when code word x*-*^ is sent) < e (1-40) 

i 

Let the number of code words for which the probability of error is greater than 2e be denoted 
N2t] then 

^A^2e(2e) < 6 (1.41) 

so that N2e < 2''^-\ If we throw away these code words and their messages then all code 
words have probability of error less than 2e. There will now be 2'"^"^ messages communicated, 
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and the effective rate will be 

Rate 



log(number of messages) 
number of symbols sent 
rR — 1 

^ R (1.42) 



for large r. 

For the converse, we begin by noting that the channel transition probability for a string of r 
symbols factorises (by the memoryless channel assumption): p{yi . . . yr\xi . . . Xr) = p{yi\xi) ■ ■ ■ p{yr 
It is then easy to show that 

r 

H{Y^\X:) = Y,H{Y,,\Xs,). (1.43) 

i=l 

Also, since H{X,Y) < H{X) + H{Y), we have H{Y^) < J2iH{Yc,i), so that 

liYj-.Xl) = H{Y^)-H{Y^\Xl) 

r 

< Y,H{Y,^,\Xs,^)-H{Y,^,) 

i=l 

= Y,IiY,,^:Xs,i)<rC. 

But mutual information is symmetric, I{X : Y) = I{Y : X), and using the fact that H{Xl) = rR 
we find 

I{Xl : y;) = H{X:) - H{Xl\Y^) =rR- H{Xl\Y^) < rC. (1.44) 

The quantity ^H{Xl\Y^) measures our average uncertainty about the input after receiving the 
channel output. If our error probability goes to zero as r increases, then this quantity must 
become arbitrarily small, whence R < C. 

1.2 Quantum Mechanics 

Quantum mechanics is a theory that caught its inventors by surprise. The main reason for this 
is that the theory is heavily empirical and pragmatic in flavour — the formalism was forced onto 
us by experimental evidence — and so contradicted the principle-based "natural philosophy" 
tradition which had reaped such success in physics. In the absence of over-arching principles we 
are left with a formalism, fantastically successful, which is mute on several important subjects. 

What is a quantum system? In the pragmatic spirit of the theory, a quantum system is one 
which cannot be described by classical mechanics. In general quantum effects become important 
when small energy differences become important, but in the absence of a priori principles we can 
give no strict definition. For example, NMR quantum computing can be described in entirely 



classical terms [|11|] and yet is advertised as the first demonstration of fully quantum computing; 
and Kitaev has conjectured that some types of quantum systems can be efficiently simulated 
by classical systems. The quantum-classical distinction is not crucial to this thesis, where we are 
dealing with part of the formal apparatus of the theory; indeed, there is some hope that from 



this apparatus can be coaxed some principles to rule the quantum world [13|. 



For our purposes, the following axioms serve to define quantum mechanics: 



1. States. A state of a quantum system is represented by a bounded linear operator p (called 
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a density operator) on a Hilbert space TL satisfying the following conditions: 

• p is Hermitian. 

• The sum of the eigenvalues is one, Tr p = 1. 

• Every eigenvalue of p is nonnegative, which we denote by p > 0. 

Vectors of the Hilbert space will be denoted by {ip) (Dirac's notation), and the inner product 
of two vectors is denoted (V'l^)- In all cases considered in this thesis, the underlying Hilbert 
space will be of finite dimensionP^. 

2. Measurement. Repeatable, or von Neumann, measurements are represented by complete 
sets of projectors^ Hj onto orthogonal subspaces ofTC. By complete we mean that Hj = 
1 (the identity operator), but the projectors are not required to be one-dimensional. Then 
the probability of outcome i is 

p{i) = Tr Uip. (1.45) 

After the measurement the new state is given by p'^ = HipHi/Tr Uip if we know that 
outcome i was produced, and otherwise by p' = ^ - ILipILi if we merely know a measurement 
was made. 

3. Evolution. When the system concerned is sufficiently isolated from the environment and 
no measurements are being made, evolution of the density operator is unitary: 

p — > UpU\ (1.46) 

where U is unitary so that U^^ = C/^. The dynamics of the system are governed by the 
Schrodinger equation. In this thesis we will not be concerned about the specific operator 
U relevant for a certain situation since we will be more concerned with evolutions that are 
in principle possible. However, we observe that for a given system the unitary operator is 
given hy U = exp{iHt), where H is the Hamiltonian of the system and t is the time. 

These are the basic tools of the mathematical formalism of quantum mechanics. In the following 
sections we will consider systems which are not fully isolated from their environment and in 
which measurements occur occasionally. Our aim is to complete our toolkit by discovering what 
evolutions are in principle possible for a real quantum system within the framework of the axioms 
above. 



1.2.1 States and Subsystems 

A density operator p which is a one-dimensional projector is called a pure state. In this case 
p = for some vector {tp) in the Hilbert space 7i, and we frequently call l-^) the state 

of the system. Pure states occupy a special place in quantum theory because they describe 
states of maximal knowledge ^ — no further experiments on such a pure state will allow us to 
predict more of the system's behaviour. There is in fact a strong sense in which mixed states (to 
be discussed below) involve less knowledge than pure states: Wootters [^] showed that if the 
average value of a dynamical variable of a system is known, the uncertainty in this observable 
decreases if we have the additional knowledge that the state is pure. 

^■^For the case of an infinite-dimensional Hilbert space, additional technical requirements regarding the com- 
pleteness of the space and the range of the operator p must be imposed. 

^^A measurement is also frequently associated with a Hermitian operator E on the Hilbert space; correspondence 
with the current formalism is achieved if we associate with E the set of projectors onto its eigenspaces. 
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The density operators form a convex set, which means that for any (0 < A < 1), a convex 
combination Api + (1 — A)p2 of two density operators pi, p2 is again a density operator. The pure 
states are also special in this context: they form the extreme points of this convex set, which 
means that there is no non-trivial way of writing a pure state a convex combination of 

other density operators. Also, since a general density operator is Hermitian and positive, there 
exists a set of vectors and real numbers Aj such that 

D 

p = Y,M^i){^i\ (1-47) 
1=1 

where D is the dimension of the underlying Hilbert space; this result is the spectral theorem for 
Hemitian operators. The eigenvalues of p are then the Aj, and they sum to one. For this reason 
mixed states are frequently regarded as classical probability mixtures of pure states. Indeed, if 
we consider the ensemble in which the pure state IV^j) occurs with probability Aj, and we make 
a measurement represented by the projectors {11,^}, then the probability of the outcome p is 

1=1 \i=l ) 

where we have used the linearity of the trace to take the As inside. 

Thus if a pure state results from maximal knowledge of a physical system, then a mixed state 
arises when our available knowledge of the system doesn't uniquely identify a pure state — we 
have less than maximal knowledge. Mixed states arise in two closely related ways: 

• We do not have as much knowledge as we could in principle, perhaps due to imperfections 

in our preparation apparatus, and must therefore characterise the state using a probability 
distribution over all pTirc states compatible with the knowledge available to us. 

• The system A is part of a larger system AB which is in a pure state. In this case our 
knowledge of the whole system is maximal, but correlations between subsystems don't 
permit us to make definite statements about subsystem A. 

Prom an experimenter's viewpoint these situations are indistinguishable. Consider the simplest 
canonical example of a quantum system, a two-level system which we will refer to as a qubit. 
We can arbitrarily label the pure states of the system |0) and which could correspond to 
the two distinct states of the spin degree of freedom ('up' and 'down') in a spin-1/2 particle. 
The experimenter may produce these particles in an ionisation chamber, and the density matrix 
describing these randomly produced particles would be 5|0)(0| + ||1)(1| = 5I. If he were 
extremely skilled, however, the experimenter could observe the interactions which produced each 
electron and write down an enormous pure state vector for the system. When he traced out the 
many degrees of freedom not associated with the electron to which his apparatus is insensitive, 
he would again be left with the state ^1. 

The "tracing out of degrees of freedom" is achieved in the formalism by performing a partial 
trace over the ignored system. Suppose we consider two subsytems A and B and let |la) (where 
a = 1,... ,Di) and |2q,) (where a = 1,... ,^2) be orthonormal bases over the two systems' 
Hilbert spaces respectively. Then if p"^^ is a joint state of the system, the reduced density matrix 
p^ of system A is 

152 

p^ = J](2,|p^^|2«)^T:bP (1.49) 

a=l 
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with a similar expression holding for subsystem B. We frequently use the matrix elements 
Paabi3 ~ (la | (2a | /O^^ 1 2/3) 1 1ft) of p^^ to represent the density operator of a system; in this notation, 
the operation of partial trace looks like 

ptb = Y.piX- (1-50) 

As a converse to this, one can also consider the purifications of a mixed state. Suppose our 
state p is an operator on the Hilbert space Ha with spectral decomposition p = A^|^^)('0^|• 
Then consider the following vector from the space Ha (SD Hb (dimi^y^ < dim Hb)- 

l-^^"") =J2^/Xi\Bi)^\i;i), (1.51) 

i 

with the \Bi) any orthogonal set in Hb- Then p = Ttb |^'^^)(^''^'^|, so p can be considered 
to be one part of a bipartite pure state. There are of course an infinite number of alternative 
purifications of a given mixed state. 

Example 1.6 Two spin- 1/2 systems 

The simplest bipartite system is a system of two qubits. The sets {|Oyi), |1a)} and {|0b), |1_b)} 
are bases for the two individual qubit's spaces. If we denote a vector {za) IJb) by \ij), then a 
basis for the combined system is {|00), |01), |10), |11)}. Consider the pure state of the combined 
system 

|V^') = -^(|10)-|01)); (1.52) 

which has density matrix 

= \^p-){^p-\ = ^(|10)(10| + |01)(01| - |10)(01| - |01)(10|) ; (1.53) 

tracing out system B removes the second index and retains only those terms where the sec- 
ond index of the bra and ket are the same. So the reduced density matrix of system A is 
p^ = l(|l)(l| + |0)(0|) = ll. 

The singlet state is a state of maximum uncertainty in each subsystem. This is because 
of the high degree of entanglement between its two subsystems — and this state, along with 
the triplet states IV'"'') and |</)^), is a canonical example of all the marvel and mystery behind 
quantum mechanics. We will return to investigate some properties of this state in later chapters. 

1.2.2 Generalised Measurements 

Von Neumann measurements turn out to be unnecessarily restrictive: the number of distinct 
outcomes is limited by the dimensionality of the system considered, and we can't answer simple 
questions that have useful classical formulations p. 280]. Our remedy for this is to adopt a 
new measurement technique. Suppose we wish to make a measurement on a quantum state p^ 
with Hilbert space Hg. We will carry out the following steps: 

• Attach an ancilla system in a known state p". The state of the combined system will be the 
product state p^ ® p"'. Note that our ancilla Hilbert space could have as many dimensions 
as we require. 
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• Evolve the combined system unitarily, ® — > Up"^ ^ p°'U^ . The resulting state is likely 
going to be entangled. 

• Make a von Neumann measurement on just the ancilla represented by the projectors 1^ <Si 
Ha, where Ha acts on the ancilla. The probability of outcome a will then be 

p{a) = Tr (l0 UaUp' /C/"^) . (1.54) 

Of course we are not interested in the (hypothetical) ancilla system we have introduced, and so 
we can simplify our formalism by tracing it out earlier. If we write 

Ea = TranciUa (l ® H^C/l ® p'^U^^ (1.55) 

then Ea is an operator over the system Hilbert space Hg, and the probability of outcome a is 
given by the much simpler formula p{a) = Tr (EaP^), where the trace is over just the system 
degrees of freedom. Note that the final two steps can be amalgamated: unitary evolution followed 
by measurement is exactly the same as a measurement in a different basis. However in this case 
we will have to allow a von Neumann measurement on the combined system, not just the ancilla. 

The set of operators {Ea} is called a POVM, for Positive Operator- Valued Measure, and 
represents the most general type of measurement possible on a quantum system. Note that, 
in contrast with a von Neumann measurement, these operators are not orthogonal and at first 
glance there doesn't appear to be any simple way to represent the state just after a measurement 
has been made. We will return to this in the next section. 

The set {Ea} has the following characteristics: 

1- EaEa = t. 

2. Each Ea is a Hermitian operator. 

3. All the eigenvalues of the Ea are non-negative i.e. Ea > 0. 

These are in some sense the minimum specifications required to extract probabilities from a 
density operator. The first requirement ensures that the probability of obtaining some outcome 
is one; the second ensures that the probabilities are real numbers, and the third guarantees that 
these real numbers are non-negative. It is therefore pleasing that any set of operators satisfying 
these 3 conditions can be realised using the technique described at the start of this section: this 
is the content of Kraus' Theorem Q. So if we define a POVM by the above three requirements, 
we are guaranteed that this measurement can be realised by a repeatable measurement on a 
larger system. And the above characterisation is a lot easier to work with! 

1.2.3 Evolution 

What quantum state are we left with after we have performed a POVM on a given quantum 
state? We will first look at this question for the case where the system starts in a pure state, 

= and the ancilla is in a mixed stat^^ = X^i /^il<^?)('^?l- Then according to the 

measurement axiom presented previously, the final state of the combined system will be 

p^ = 1 ^^ p,£/P^|r) ® mm ® {<^^\PaU^ (1.56) 



We could, without loss of generality, assume the ancilla starts in a pure state. 
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where Pq, = C/^(l Iia)U represents a projection onto some combined basis of the system and 
anciha. Now we define a set of operators on the system, 



(1.57) 



where is an arbitrary orthonormal basis for the ancilla Hilbert space and is the corre- 
sponding POVM element. Then the state of the system after measurement can be expressed in 
terms of these operators: 



1 



TV {E^p' 



(1.58) 



where we have amalgamated the indices {i, k) into one index b. This operation is linear so that 
for any mixed state p*, the state after measurement is 



Tr 



if we know that the outcome is a, and 



t 

ha 



(1.59) 



(1.60) 



ha 



if we only know that a measurement has been made. Note that we can choose different orthonor- 
mal bases \(,k) for each value of a to make the representation as simple as possible — and that in 
general, the state resulting from a POVM depends on exactly how the POVM was implemented. 
The operators Mba are not free, however; they must satisfy 

^M^^Mb^ = ^PiiicsmPa{i'^m){^^m)Pa{tcs\^k)) 

b i,k 



Tr 



Ti^ 



ancilla 



ancilla 



Pail^p^Pa 



1 (g> UaUl (g> p^-U^ 



E^ 



(1.61) 



(from Eqn 1.55| ). We will call a generalised measurement efficient if for each value of a, there 



is only one value of b in the sum in Eqn 1.5£; such a measurement is called efficient because 
pure states are mapped to pure states, so if we have maximal knowledge of the system before 
the measurement this is not ruined by our actions. 

Suppose now we have an arbitrary set of operators A^^ satisfying (as do the Mba above) 
V iV^AT^ = 1- Then the mapping defined by 



P^%{p) = Y.N^pNl 



(1.62) 



is called an operator- sum and has the following convenient properties: 



1. If Tr p = 1 then Tr $(p) = 1 ($ is trace-preserving). 
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2. $ is linear on the space of bounded linear operators on a given Hilbert space. 

3. If p is Hermitian then so is ${p). 

4. If /3 > then ${p) > (we say $ is positive). 

If we add one more condition to this list, then this list defines an object called a superoperator. 
This extra condition is 

5. Let In be the identity on the n-dimensional Hilbert space. We require that the mapping 
T„ = $ eg) 1„ be positive for all n. This is called complete positivity. 

Physically, this means that if we include for consideration any extra system X — perhaps some 
part of the environment — so that the combined system is possibly entangled, but the system X 
evolves trivially (it doesn't evolve), the resulting state should still be a valid density operator. 
It turns out that an operator-sum does satisfy this requirement and so any operator-sum is also 
a superoperator. 

Superoperators occupy a special place in the hearts of quantum mechanicians precisely because 
they represent the most general possible mappings of valid density operators to valid density 
operators, and thus the most general allowed evolution of physical states. It is thus particularly 
pleasing that we have the following theorems (proved in []l6| ): 

1. Every superoperator has an operator-sum representation. 

2. Every superoperator can be physically realised as a unitary evolution of a larger system. 

The first theorem is a technical one, giving us a concrete mathematical representation for the 
"general evolution" of an operator. The second theorem has physical content, and tells us that 
what we have plucked out of the air to be "the most general possible evolution" is consistent 
with the physical axioms presented previously. 



CHAPTER 2 
Information in Quantum Systems 



At the start of the previous chapter, we identified information as an abstract quantity, useful 
in situations of uncertainty, which was in a particular sense independent of the symbols used 
to represent it. In one example we had the option of using symbols Y and N or integers 
{0, 1, . . . ,365}. But we in fact have even more freedom than this: we could communicate an 
integer by sending a bowl with n nuts in it or as an n volt signal on a wire; we could relay Y or 
N as letters on a piece of paper or by sending Yevgeny instead of Nigel. 

Physicists are accumstomed to finding, and exploiting, such invariances in nature. The term 
"Energy" represents what we have to give to water to heat it, or what a rollercoaster has at 
the top of a hump, or what an exotic particle possesses as it is ejected from a nuclear reaction. 
Information thus seems like a prime candidate for the attention of physicists, and the sort of 
questions we might ask are "Is information conserved in interactions?", "What restrictions are 
there to the amount of information we can put into a physical system, and how much can we 
take out of one?" or "Does information have links to any other useful concepts in physics, like 
energy?". This chapter addresses some of these questions. 

The physical theory which we will use to investigate them is of course quantum mechanics. But 
— and this is another reason for studying quantum information theory — such considerations 
are also giving us a new perspective on quantum theory, which may in time lead to a set of 
principles from which this theory can be derived. 

One of the major new perspectives presented by quantum information theory is the idea of 
analysing quantum theory from within. Often such analyses take the form of algorithms or cyclic 
processes, or in some circumstances even games [^]; in short, the situations considered are finite 
and completely specified. The aims of such investigations are generally: 

• To discover situations (games, communication problems, computations) in which a sys- 
tem behaving quantum mechanically yields qualitatively different results from any similar 
system described classically. 

• To investigate extremal cases of such "quantum violation" and perhaps deduce fundamental 
limits on information storage, transmission or retrieval. Such extremal principles could also 
be useful in uniquely characterising quantum mechanics. 

• To identify and quantify the quantum resources (such as entanglement or superposition) 
required to achieve such qualitative differences, and investigate the general properties of 
these resources. 

In many of these investigations, the co-operating parties Alice and Bob (and sometimes their 
conniving acquaintance Eve) are frequently evoked and this thesis will be no exception. In this 
chapter we consider a message to be sent from Alice to Bob|^, encoded in the quantum state of 
a system. For the moment we will be assuming that the quantum state passes unchanged (and 
in particular is not subject to degradation or environmental interaction) between them. 



2.1 Physical Implementation of Communication 

In classical information theory, the bit is a primitive notion. It is an abstract unit of informa- 
tion corresponding to two alternatives, which are conveniently represented as and 1. Some of 



^^The contrasting of preparation, missing and accessible information presented here is based on [|l4 
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Figure 2.1: Alice and Bob using a physical system to convey information 



the characteristics of information encoded in bits are that it can be freely observed and copied 
(without changing it), and that after observing it one has exactly the right amount of informa- 
tion to prepare a new instance of the same information. Also, in order to specify one among N 
alternatives we require log A bits. 

The quantum version of a two-state system has somewhat different properties. Typical ex- 
amples of two state systems are spin-1/2 systems, where the 'spin up' state may be written 1 1) 
and 'spin down' | j), or the polarisation states of photons, which may be written 1 1) for vertical 
polarisation and | <-^) for horizontal. For convenience we will label these states |0) and |1), using 
the freedom of representation discussed previously. A new feature arises immediately in that the 
quantum two-state system — or qubit]^ — can exist in the superposition state |0) + |1) which, 
if measured in the conventional basis {|0), |1)} will yield a completely random outcome. 



We will analyse the communication setup represented in Figure 2.1. Alice prepares a state 



of the physical system A and sends the system to Bob who is free to choose the measurement 
he makes on A. Bob is aware of the possible preparations available to Alice, but of course is 
uncertain of which one in particular was used. We will identify three different types of information 
in physical systems (preparation, missing and accessible information) and briefly contrast the 
quantum information content with the information of an equivalent classically described system. 



2.1.1 Preparation Information 

Let us suppose for a moment that Alice and Bob use only pure states of a quantum system to 
communicate (mixed states will be discussed later). To be more specific, suppose they are using 
a two-level system and the k^^ message state is represented by 

m = ak\0) + bk\l) (2.1) 

where ak and bk are complex amplitudes. We can choose the overall phase so that G [R in 
which case {ipk) is specified by two real numbers (a^ and the phase of bk)- How much information 
in a real number? An infinite amount, unfortunately. We overcome this problem, as frequently 
done in statistical physics, by coarse-graining — in this case by giving the real number to a fixed 
number of decimal places. In practice, Alice's preparation equipment will in any case have some 
finite resolution so she can only distinguish between a finite number J\f of signals. So to prepare 
a signal state of this qubit Alice must specify log M bits of information — and if she wants Bob 

^^This word was coined by Schumacher 
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to have complete knowledge of which among the signal states was prepared, she must send him 
exactly logAA classical bits. 

If Alice and Bob are using an A'^-level quantum system the state of the system is represented 
by a ray in a projective Hilbert spacePI Tl. A natural metric on Ti, given by Wootters ||l^, is 

diiJo,^i) = cos-^\{Mi^i)\ (2.2) 

where the \ipi) are normalised representatives of their equivalence classes. With this metric, the 
compact space 7i can be partitioned into a finite number J\f of cells each with volume less than 
a fixed resolution dV, and these cells can be indexed by integers j = 1, 2, . . . ,J\f. By making dV 
small enough, any ray in cell j can be arbitrarily well approximated by a fixed pure state {ipj) 
which is inside cell j. Thus our signal states are the finite set (IV'i), • • • , IV'A/')}) aiid they occur 
with some pre-agreed probabilities pi, ■ ■ ■ ,pj\f. 

At this point we can compare the quantum realisation of signal states with an equivalent 
classical situation. In the classical case, the system will be described by canonical co-ordinates 
Pi, . . . , Pp, Qi, . . . ,Qf in a phase space. We can select some compact subset W to represent 
practically realisable states, and partition W into a finite number M of cells of phase space 
volume smaller than dVc- A typical state of the system corresponds to a point in the phase 
space, but on this level of description we associate any state in cell j with some fixed point in 
that cell. 

By "preparation information" we mean the amount of information, once we know the details of 
the coarse-graining discussed above, required to unambiguously specify one of the cells. From the 
discussion of information theory presented in Chapter 1, we know that the classical information 
required, on average, to describe a preparation is 

H{p) = -Y^pU) log p{j) (2.3) 

which is bounded above by logAA. Thus by making our resolution volume {dV or dVc) small 
enough, we can make the preparation information of the system as large as we want. 
How small can we make it? 



Von Neumann Entropy and Missing Information In classical statistical mechanics, there 
is a state function which is identified as "missing information" . Given some incomplete informa- 
tion about the state of a complicated system (perhaps we know its temperature T, pressure P 
and volume V), the thermodynamic entropy is defined^ up to an additive constant, to be 

S{T, V,P) = k log W{T, V, P) (2.4) 

where W is the statistical weight of states which have the prescribed values of T,V and P; this 
is the Boltzmann definition of entropy. The idea is very similar to that expressed in Chapter 1: 
the function ^(r, V, P) is, loosely speaking, the number of extra bits of information required — 
once we know T,V and P — in order to determine the exact state of the system. According to 
the Bayesian view, these macroscopic variables are background information which allow us to 
assign a prior probability distribution; our communication setup is completely equivalent to this 
except for the fact that we prescribe the probability distribution ourselves. So it is reasonable 



to calculate a function similar to Eqn 2.4 for our probability distribution and call it "missing 



^®A^- dimensional projective Hilbert space is the set of equivalence classes of elements of some A'^- dimensional 
Hilbert space, where two vectors are regarded as equivalent if they are complex multiples of each other. 
There are many ways of approaching entropy; see pql and 21 . 
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information" . 

Consider an ensemble of ^ 1 classical systems as described above. Since v is large, the 
number Vj of systems in this ensemble which occupy cell j is approximately p{j)i^. The statistical 
weight is then the number of different ensembles of v systems which have the same number of 
systems in each cell |l23| : 

(2.5) 



The information missing towards a "microstate" description of the classical state is the average 
information required to describe the ensemble: 

S{p) = —/clog 



-k^p{j) log p{j) (2.6) 



where we have used Stirling's approximation. Thus, up to a factor which expresses entropy in 
thermodynamic units, the information missing towards a microstate description of a classical 
system is equal to the preparation information. 

The von Neumann entropy of a quantum state p is defined to be 

S{p) = -TT{p\ogp). (2.7) 

While this definition shares many properties with classical thermodynamic entropy, the field of 
quantum information theory exists largely because of the differences between these functions. 
However, there are still strong reasons for associating von Neumann entropy with missing in- 
formation. Firstly, if the ensemble used for communication is the eigenensemble of p (i.e. the 
eigenvectors of p appearing with probabilities equal to the eigenvalues Aj) then an orthogonal 
measurement in the eigenbasis of p will tell us exactly which signal was sent. Secondly, if S{p) > 
then any measurement (orthogonal or a POVM) whose outcome probabilities = Ti {pE^) yield 



less information than S{p) cannot leave the system in a pure quantum state |14|. Intuitively, we 
start off missing S{p) bits of information to a maximal description of the system, and discovering 
less information than this is insufficient to place the system in a pure state. 
Some properties of the von Neumann entropy are [22|, |21]: 



1. Pure states are the unique zeros of the entropy. The unique maxima are those states 
proportional to the unit matrix, and the maximum value is log D where D is the dimension 
of the Hilbert space. 

2. If [/ is a unitary transformation, then S{p) = S{U pW). 

3. S is concave i.e. if Ai, ... , A^ are positive numbers whose sum is 1, and pi, ■ ■ ■ , Pn are 
density operators, then 

n n 

S{^Kpi)>^XiS{pi). (2.8) 

i=l i=l 

Intuitively, this means that our missing knowledge must increase if we throw in additional 
uncertainties, represented by the Aj's. 



4. S is additive, so that if pA is a state of system A and /Ob a state of system B, then 
S{pA ® Pb) = S{pa) + S{pb)- If PAB is some (possibly entangled) state of the systems. 
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and PA and pB are the reduced density matrices of the two subsystems, then 

S{pab) < S{pa) + S{pb). (2.9) 

This property is known as subadditivity, and means that there can be more predictabihty 
in the whole than in the sum of the parts. 

5. Von Neumann entropy is strongly subadditive, which means that for a tripartite system in 
the state pabC: 

S{pABc) + S{pb) < S{pab) + S{pBc). (2.10) 

This technical property, which is difficult to prove (see [^]), reduces to subadditivity in 
the case where system B is one-dimensional. 

So what is the relationship between preparation information and missing information in the 
quantum mechanical version of the communication system? Well, given the background informa- 
tion discussed above (i.e. knowledge of the coarse-grained cells and their a priori probabilities) 
but no preparation information (we don't know which coarse-grained cell was chosen), the den- 
sity operator we assign to the system]^ is p = X]p(i)IV'i)(V'ih so the missing information is S[p). 
Now it can be shown [^ ] that 

D Af 
S{p) = -^AfclogA, < - Y.p{j)logp{j)=Ip. (2.11) 
fe=i j=i 

In this expression, the are eigenvalues of p, D is the dimension of p, and Ip is the preparation 
information for the ensemble. Equality holds if and only if the ensemble is the eigenensemble of 
p, that is, if all the signal states are orthogonal. 

The conclusion is that preparation information Ip and missing information S are equal in 
classical physics, but in quantum physics S < Ip, and the preparation information can be made 
arbitrarily large. One question still remains: How much information can be extracted from the 
quantum state that Alice has prepared? 



2.1.2 Accessible Information 

Alice and Bob share the "background" information concerning the partitioning of the projec- 
tive Hilbert space and the probability distribution of signals p{j). Bob, knowing this information 
and no more, will ascribe a mixed state p = p{i)\fpi) {ipi\to the system he receives. His duty 
in the communication scenario is to determine as well as he can which pure state he received. Of 
course, if he has very precise measuring equipment which enables him to perform von Neumann 
(orthogonal) measurements, he can say immediately after the measurement that the system is 
in a pure state — if the measurement outcome was '3', he knows the system is now exactly in 
state 1 3). But this information is not terribly useful for communication. 

Unfortunately for Bob, exact recognition of unknown quantum states is not possible. This is 
related to the fact that there is no universal quantum cloning machine |35]; that is, there doesn't 
exist a superoperator $ which acts on an arbitrary state IV') and an ancilla in standard state |s) 
as 



$:|V^)|s)^|V)|V). (2.12) 

^^Note that this density operator encapsulates the best possible predictions we can make of any measurement 
of the system — this is why we employ it. 
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Cloning violates either the unitarity or the linearity of quantum mechanics (see Section \i.2\} , 
and could be used for instantaneous signalling [^6| or discrimination between non-orthogonal 
quantum states. 

The only option open to Bob, once he receives the quantum system from Alice, is to perform 
a POVM on the system. Suppose Bob chooses to perform the POVM {Eh}; the probability of 
outcome b is Tr (pEi,). His task is to infer which state was sent, and so he uses Bayes' Theorem 
(EqniLD: 

where 0;, represents the event "The outcome is 6" and Pj the event "The preparation was in 
cell j". We can calculate all the quantities on the right side of this expression: p{Ob) = Tr pEf,, 
p{Ob\Pj) = Tr \ipj){ipj\Eh = (TpjlEhlipj) and p(Pj) = pU)- Thus we can calculate the post- 
measurement entropy 

H{F\0) = - Y,PiOb)^p{'Pj\Ob)\ogp{Fj\Ob). (2.14) 

b j 

We can now define the information gain due to the measurement {Ef,} as 

I{{Eb}) = Ip- HiF\0) (2.15) 
and the accessible information ||2^ is defined as 

J = maxI{{Eb}). (2.16) 

{^b} 

The accessible information is the crucial characteristic of a physical communication system. 
In classical physics, states in different phase space cells are perfectly distinguishable as long as 
Bob's measuring apparatus has fine enough resolution; hence the accessible information is in 
principle equal to the preparation information and the missing information. What can be said 
in the quantum situation? 

The accessible information is unfortunately very difficult to calculate, as is the measurement 
which realises the maximum in Eqn 2.16| . Davies |25] has shown that the optimal measurement 



consists of unnormalised projectors Ei, . . . , E^, where the number of such projectors is bounded 
hy D < N < D^, where D is the Hilbert space dimension]^. Several upper and lower bounds 



have been derived |26], some of which yield measurements that attain the bound. The most 



well-known bound is the Holevo Bound, first proved by Holevo in 1973 |27|. The bound is 



J < Sip); (2.17) 

accessible information is always less than or equal to missing information. This is in fact the 
tightest bound which depends only on the density operator p (and not on the specific ensemble), 
since the eigenensemble, measured in the eigenbasis, realises this bound, J = S{p). The bound 
is not tight, and in many situations there is a significant difference between J and S{p) p8|. 



[29|. The Holevo Bound will be proved below — in more generality, when we discuss mixed state 



messages — and a physical interpretation will be discussed later in Section 2.4.1 



Davies also showed how to calculate or constrain the measurement if the ensemble exhibits symmetry. 
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2.1.3 Generalisation to Mixed States 

We have found that the accessible information J and the preparation information Ip for an 
ensemble of pure states obey 

< J < S{p) < Ip. (2.18) 

What can be said in the situation when the states comprising the ensemble are mixed states? 

Consider the ensemble of states pi occurring with probabilities p{i), and suppose that pi = 
'l2k^k^\^k^)i'i^k^\ spectral decomposition of each signal state. We could then substitute 

the pure state ensemble \4>k^) occurring with probabilities p{i)\u^; then 



s{p) = s[Y,pi^)p^]=s[Y.pmrmi4 

\ i / \ i,k 

< -Y.P^^>^k^^&Pii)>^k (2-19) 

i,k 

= -^K01ogp(i)-5^p(^)5;A«logA« =Ip + ^p(f)5(p,) (2.20) 

k 



where Eqn 2.19 follows from the pure state result, Eqn 2.11. We thus conclude that the prepa- 



ration information is bounded below by 

Ip>S{p)-Y,p{i)S{p-,)^x{£) (2.21) 

i 

where £ = {pi,p{i)} denotes the ensemble of signal states. The function x(^) is known by 
various names, including Holevo information and entropy defect. It shares many properties with 
von Neumann entropy, and reduces to S in the case of a pure state ensemble. We can compare 



the Holevo information with the definition of mutual information, Eqn 1.26 , 

/(X : Y) = H{X) - H{X\Y), 

and we see that Holevo information quantifies the reduction in "missing information" on learning 
which state (among the pi) was received. 

The Holevo Bound For pure states, the accessible information is bounded above by the von 
Neumann entropy; it turns out that, as with preparation information, the generalisation to mixed 
states is achieved by substituting the Holevo information for S. This more general Holevo bound 



can be proved fairly easily once the property of strong subadditivity of S has been shown |22|. 
We will assume this property. 

Alice is going to prepare a quantum state of system Q drawn from the ensemble £ = {pi,p{i)}. 
These states are messy — they may be nonorthogonal or mixed — but in general Alice will keep 
a classical record of her preparations, perhaps by writing in her notebook. We will call the 
notebook quantum system X, and assume that she writes one of a set of pure orthogonal states 
in her notebook. Thus for each value of i, there is a pure state \i) of the notebook X; to send 
message i, Alice prepares \i){i\ (X" Pi with probability p{i), and these state are orthogonal and 
perfectly distinguishable. 

Bob receives the system Q from Alice and performs a POVM {Ef,} on it. Bob finds POVMs 
distasteful, so he decides to rather fill in all the steps of the measurement. He appends a system W 
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onto Q and performs an orthogonal measurement on QW, represented by the unitary operators 
{Fb} which are mutually orthogonal. Lastly, in order to preserve a record of the measurement, 
Bob has his notebook Y which contains as many orthogonal states as measurement outcomes b. 
His measurement will project out an orthogonal state of QW which he will transcribe into an 
orthogonal state of his notebook^. 

The initial state of the entire setup is 

PXQWY = ^p{i)\ix){ix\ ^Pi<S) \0w){0w\ <S) |OY)(Oy|. (2.22) 

i 

When Bob receives system Q from Alice, he acts on the combined system QWY with the unitary 
operation 

Uqwy : \(t>)Q » \0w) ® |0y) -^J2^b{\(t>Q) ® \0w)) » |6y) (2.23) 

b 

for any pure state {(pq) in Q, where the are mutually orthogonal. The state of the combined 
system after Bob performs this transformation is 

p'xQWY = Y,pii)\ix){ix\(S) Fb[p^^\Ow){Ow\]Fb'C^\bY){bY\■ (2.24) 

i,b,b' 

We will be using strong subadditivity in the form 

S{Pxqwy) + Sip'r) < S{pxy) + SipqwY)- (2.25) 

We note first that due to unitary invariance 5'(/0xqvky) — S{pxqwy), and these are equal to 
S{pxq) since the systems W and Y are in pure states (with zero entropy). Thus 

S{p'xqwy) = s(^^p{i)\ix){ix\® p'i^ 

= -^Tr [p(i)p,logp(i)/5.] (2.26) 

i 

i i 

= H{X) + ^p{i)S{p,) (2.27) 



where 2.2f: follows from the fact that pxQ is block diagonal in the index i. To calculate p'xy^ 



note that 

Tt {Fb[pi^\Ow){Ow\]Fb') = Tt {Fb^FbPi^\Ow){Ow\) 

= 5bb' Tr {Fbp^ |0h/)(0h/|) = 5bb' p{Ob\Pi) (2.28) 

where the second equality follows from the orthogonality of the measurement. Thus we have 



■^^Transcribing is another way of saying cloning, and quantum mechanics forbids universal cloning js^]. Ifowever, 
cloning one of a known set of orthogonal states is allowed. 
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that 

p'xY = '^P{''')p(.Ob\Pi) \ix){ix\ ® \by){bY\ 

i,b 

=^ S{pxy) = - Y.P^'^ ^) ^) = (2.29) 

i,b 

and by taking another partial trace (over X) we find 

p'y = Tvx '^p{i,b)\ix){ix\ ^ \by){bY\ = y^p(6)|6y)(6y | 

i,b b 

=^ S{Py) = -Y,p{b)logp{b) = H{Y). (2.30) 

The transformation pqwy — > p'qwy unitary, so 

S{p'qwy) = S{pqwy) = S{pq) 

= S{p) (2.31) 

where the second equahty follows from the purity of the initial states of W and y, and we have 



used the previous notation p = '^iP{i)pi- Combining Eqns 2.25 through 2.31, we find 



H{X) + Y,P{^S{pi) + H{Y) < H{X, Y) + Sip); (2.32) 

i 

recalling the definition of mutual information, we end up with the Holevo bound: 

liX : Y) = H{X) + HiY) - H(X, Y) < S{p) - Y,P{i)S{p^) = x{£). (2-33) 

i 

2.1.4 Quantum Channel Capacity 

The original, practical problem which motivated this chapter was: How does the quantum 
nature of the information carrier affect communication between Alice and Bob? We have found 
that, in contrast with classical states, information in quantum states is slippery and often in- 
accessible. So why would we want to employ non-orthogonal states, or even mixed states, in a 
communication system? 

The practical answer is that sometimes we cannot avoid it. If we send photons down an 
optical fibre, then in order to achieve a high transmission rate we will need to overlap the photon 
packets slightly, which means we are using nonorthogonal states. In fact, Fuchs has shown 
that for some noisy channels the rate of transmission of classical information is maximised by 
using nonorthogonal states! And if we hope to maximise transmission rate, we are going to have 
to have a deeper understanding of how errors are introduced into the photon packets, which 
requires us to deal with the mixed states emerging from the optical fibre — even if the input 
states were pure and orthogonal and so essentially "classical" . 

The question we now turn to is: What is the maximum information communicated with 
states drawn from the ensemble {/Oj,Pi}? The answer to this question was given by Hausladen 
et al |M| for ensembles of pure states and independently by Holevo [p2| and Schumacher and 



Westmoreland |33] for mixed states. The Holevo Bound can be attained asymptotically, if we 
allow collective measurements by the receiver. We will present the main concepts from the proof 
of the pure state result, without exercising complete rigour. 
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The essential idea is that of a typical subspace. A sequence of n signals from the source is 
represented by a vector \(pi-^^) . . . \ in the product Hilbert space i/", and the ensemble of such 
signals is represented by the density operator 

P^"^ = E Pn■■■P^M^)■■■\<P^Ji'P^J■■■i'^^A (2-34) 

= ...^ p (2.35) 

where p is the single system density matrix. Then for given €,5 > 0, and for n large enough, 
there exists a typical subspace A of such that [p^ j 

1. Both A and A"*" are spanned by eigenstates of 

2. Almost all of the weight of the ensemble lies in A, in the sense that 

Tr Hap^"^ > 1 - e and Tr H^x/J^") < e (2.36) 
(where Ha is used to denote the projection onto the subspace A). 

3. The eigenvalues A; of p^") within A satisfy 

2-n[s{p)+5] ^ Xi< 2-"['^('')-'^l. (2.37) 

4. The number of dimensions of A is bounded between 

(1 - e)2"[^(^)-^] < dimA < 2'^[^('')+''l . (2.38) 

To see how this typical subspace is constructed, we suppose that the signals sent are indeed 
eigenstates of p. Then the signals are orthogonal and essentially classical, governed by the 
probability distribution pi given by the eigenvalues; for the properties listed above it makes no 
difference if the signal states are indeed eigenstates or some other (nonorthogonal) states. The 
eigenvalues of p^"^ will then be pi = . . . pi^ and by the weak law of large numbers (mentioned 
in Section |1.1.3|) the set of these eigenvalues satisfies 



P\- 

n 



log/ii-5(p("))| >5^ <e. (2.39) 



Let A be the set of eigenvalues satifisfying ^ |log/Xi — S[p^'^'>)\ < 5, and define A to be the 
eigenvectors corresponding to these eigenvalues; then a moment's thought reveals A to have the 
properties listed above. 

The technique of the proof is very similar to that for Shannon's Noisy Coding theorem: we use 
the idea of random coding to symmetrise the calculation of error probability. The technique of 
random coding is illustrated in Figure p.2| . We begin with an ensemble of letter states ((/>i, ■ ■ ■ (pi 
in the Figure) and generate the ensemble of all possible n-letter words — the probability of a 
particular word just being the product of its letter probabilities. In the same way we construct 
codes each containing k words, with k left unspecified for the moment. 

The important feature of this heirarchy is that, just as there was a typical subspace of words, 
so too is there a typical subspace of codes which has the following properties: (a) the overall 
letter frequencies of each code are close to those of the letter ensemble, (b) the words in each 
typical code are almost all from the typical subspace of words and (c) the set of atypical codes has 
negligible probability for large enough n and k. This means that in calculating error probability 
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Figure 2.2: A schematic representation of random codes and their probabihties 



averaged over all codes, we can calculate the error only for these typical codes and include an 
arbitrarily small e-error for the contribution from non- typical codes. 

To calculate this error, note that the expected overlap of any two code words \u) = \(j)xi) ■ ■ ■ {(pxn) 
and \v) 



E\{u\v)\'^ = Yli ^ P^iPyi---P^^PyMxMy^)\'^ ■■■\{<i)xMyn)?' 

xi,... ,x„ yi,... ,yn 

= TV (2.40) 

By the above typicality argument, we need only consider the codes which consist of typical 
codewords when calculating averages over codes. So the expected overlap between code words 
in a typical code is^ 

EA\{u\v)f = TVA(p("))2 (2.41) 
< 2-^[3(p)+^] ^2^"['^('')^'^1^^ (2.42) 
= 2-'"['^(^)"3'^l (2.43) 

where we have used the bounds on the dimension of the typical subspace and the eigenvalues 
contained in it. So if we choose k = nS{p) — 5' then for any fixed codeword \Sj) the overlap of 
\Sj) with all other codewords, averaged over all codes, is 

E ^\{S^\Sj)\^ < 'i'^[S{p)-S']2-n[S{p)-38] ^ ^ ^ ^-""^^'-^^^ + e; (2.44) 



■^^Heuristically if we choose two unit vectors at random from a vector space of dimension D, their overlap will 
be 1/D on average. 
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the e is included for the contribution of atypical codes. For fixed 5' we can make e and 5 as small 
as we choose by making n large; so the expected overlap can be made arbitrarily small. 

We now invoke some arguments encountered previously in connection with Shannon's Noisy 
Coding theorem. Since this result (Eqn |2.44| ) holds on average, there is at least one code which 
has small average overlap between code words; and by throwing away at most half the code 



words (as discussed at Eqn 1.42) we can ensure that every codeword has less than e overlap with 
all other codewords. And of course, since almost all the codes are typical, we can choose the 
code so that the letter frequencies of the code are close to those of the original ensemble. 

The resulting code will consist of nS{p) codewords which are almost equally likely; thus 
S{p) bits are communicated per quantum system received. It is interesting to note how this 
communication is achieved: as the number of signals becomes large, it becomes possible to 
choose words which are almost orthogonal and hence highly distinguishable. However, decoding 
these message states will require sophisticated joint measurements by Bob — and the problem of 
finding the optimal measurement is very difficult. Hausladen et al [^] employ a specific POVM 
in their proof which, although not optimal, is easier to work with. They find a rigorous bound 
on the average error probability similar to that motivated here. 

We now have the result that, given an ensemble of message states 8 = {pi-,Pi}-, information 
can be communicated through a channel at the rate 

R = S{Y,P^P^) - Y^P^^^P^) = ^(^) (2.45) 

and no higher. We can define the quantum channel capacity (relative to a fixed alphabet) to be 

CQ = maxx(£:) (2.46) 



similar to the definition of channel capacity in Eqn |1.34 This is then the maximum rate at 



which information can be communicated using the given physical resources. 

2.2 Distinguishability 

In considering the information contained in a quantum state, we are dealing closely with issues 
of distinguishability. For example, suppose we use nonorthogonal states in a communication 
channel, and suppose there was a POVM which could distinguish with certainty between these 
states; then the information capacity of the channel could be calculated using the classical theory! 
In this connection several questions arise: Can we quantitatively measure the "distinguishability" 
of a set of states? Are there physical processes which increase distinguishability? Can we find 
the optimal measurement for distinguishing between states? And how can we use imperfect 
distinguishability to our advantage? 

Before looking at these issues, however, there is an important caveat to this relationship 
between accessible information and distinguishability. Suppose Alice and Bob communicate 
using two pure state signals, jV'o) and with probabilities po and pi. It can be shown that 
the von Neumann entropy of this ensemble — and hence the ability of these states to convey 
information — is a monotonically decreasing function of the overlap |(V'o|V'i)l- This makes sense: 
if we make the signal states less distinguishable they should convey less information. Intuitively, 
this should be a universal property: for any ensemble in any number of dimensions, if we make 
all the members of the ensemble more parallel we should decrease the von Neumann entropy. 



But Jozsa and Schlienz [34|, in the course of investigating compression of quantum information 
(see the next chapter), have found that this is generically not true in more than two dimensions. 
Specifically, for almost any ensemble £■ = {pi, iV'j)} in three or more dimensions, there is an 



2. Information in Quantum Systems 



35 



ensemble £' = {pi, {ipl)} such that 

1. All the pairwise overlaps of states in £' are not smaller than those in £, i.e. for all i and j 

m^,)\ < (2.47) 

2. S{£') > S{£). 

The relationship between distinguishability and information capacity is therefore not entirely 
straightforward; and in particular, the distinguishability of a set of states is a global property of 
the set, and can't be simply reduced to pairwise distinguishability of the states. 

Before even discussing measures of distinguishability, we discuss one possible way of increasing 
distinguishability. Suppose Alice has sent me a photon which I know is in a pure state of either 
horizontal or diagonal polarisation, each with probability 1/2. I decide to make 2A^ copies of the 
state and pass N through a horizontally oriented calcite crystalp^ and through a diagonally 
oriented crystal. I can deduce the polarisation state of Alice's original photon by observing which 
orientation of the crystal has all A^ photons emerging in one beam. But something is wrong: 
Alice will thus have communicated one bit of information to me, despite the fact that the Holevo 
bound dictates that the most information she can transmit is S = 0.60. 

The problem lies in the assumption of copying. Wootters and Zurek |35] considered such a 



"universal cloning machine", that is, a machine which acts on a system in state an ancilla 
in a standard state |0) and an environment in state \E) as 

IV') «> |0) \E) IV') 1^') (2.48) 

(where \ E') is any other state of the environment). They showed that if the machine is expected to 
clone just two specific nonorthogonal states iV'i) and IV'2) faithfully then it violates the unitarity 
of quantum mechanics, since 

(V'ilV'2) = ( (H ^ {0\ {E\^ (|^2>®|0)®|^)^ (2.49) 

(^il (V'll C3 {E'\] (\ij2) C3 |V'2> ® 1^')) = (V'ilV'2)' (2.50) 



which can only be satisfied if (V'ilV'2) = or 1 (the states are orthogonal or identical). The 
first inequality here follows from the normalisation of the extra states, and the second from the 
unitarity of the process. And if we expect the machine to copy more than two nonorthogonal 
states then it must violate the superposition principle, since if 1-01) and 1-02) can both be copied 
then their superposition is "copied" as 

(l^i) + IV'2)) ® |0) |E) (|V'i)^3|V'i) + lV'2)®|V'2))®|^') (2.51) 

/ (IV'i) + \H) ® (l^i) + IV'2)) \E') : (2.52) 

superpositions can't be copied by a unitary process. 



It has also been shown that cloning would allow super luminal signalling [36|. A generalisation 



of this No-Cloning theorem is the No-Broadcasting theorem |37], which states that there is no 



■^^A calcite crystal has different refractive indices along two of its symmetry axes. So when photons are incident 
on the crystal, they are refracted different amounts depending on their polarisation — hence the two beams 
emerging from the crystal have orthogonal polarisations. 
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physical process which achieves 

pyl (g) S — > pAB (2.53) 

where S is some standard state and the entangled state pab satisfies Tr^ pab = PA and 
Trs PAB = PA- Despite the impossibility of cloning, it has become a fruitful area of research. 



Bounds on the probability of success of an attempt to clone have been studied |35], and cloning 
has been used in many studies of quantum information and computation^ 

2.2.1 Wootters' Problem 



Wootters [39| considered the problem of distinguishing signals in quantum mechanics. The 
scenario he considered (which is also graphically recounted in [^]) was the use of two-state 
probabilistic information carriers for transmitting information. Suppose we know that these 
information carriers — which we can call "infons" — are represented by unit vectors in a real 
two dimensional space: for example, they may be photons which are guaranteed to be in a state 
of plane polarisation. The state of one of these infons is thus completely characterised by 
one real parameter, 

IV') = cos 6*10) + sin (9|1) (2.54) 

where |0), |1) is some orthonormal basis for the two dimensional space. Exactly like in quantum 
mechanics, |0) and |1) correspond to some maximal test Q on the infon — in the case of a plane 
polarised photon these tests may be for horizontal or vertical polarisation, using a calcite crystal. 

Where Wootters departs from conventional quantum mechanics, however, is in the probabili- 
ties of each outcome as a function of the state parameter 9. If the infon was indeed a photon, then 
the probability of obtaining outcome corresponding to |0) (say, horizontal polarisation) would be 

p{e) = cos^e. (2.55) 

Wootters leaves undecided the exact probability function, but instead asks the question: If 
Alice sends N infons to Bob, all with exactly the same value of 0, how does the number of 
"distinguishable" values of 9 vary with the probability function p{6)l And what optimal function 
p*{9) gives the most number of "distinguishable" states? 

The concept of distinguishability used here was one of practical interest. Suppose Alice would 
like to send one of I signals to Bob by preparing the N infons with one of the parameter values 
9i,92, ■ ■ ■ 9i. If Alice chooses parameter value 9k, then the probability that Bob measures n of 
them to be in state |0) (and hence — n in state |1)) is given by the binomial distribution, 

^("l^'^) = Um' „ [p(^fc)]"[l-p(gfc)]^-", (2.56) 
n\{N — n)\ 

where the function p{9) is as yet unspecified. Of course, if the signals are to be distinguished 
reliably, they must be a standard deviation a apart (or some multiple of a, depending on how 
reliable we require the communication to be). Communication using infons is thus subject to two 
competing requirements: that there be as many signal states as possible (and thus that they be 
very close together) and that they be at least a standard deviation apart. Clearly the exact form 
of the probability function p{9) will have a great impact on how many distinguishable states are 
available; if p{9) is almost uniform over < 9 < 1 then almost all values of 9 will give the same 
outcome statistics, and Bob will have no idea what value of 9 Alice used! 

^^Copious references to this can be found on the LANL preprint server, http;/ /xxx. lanl.gov/. 
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It turns out, unsurprisingly, that which function maximises the number of distinguishable 
states^ depends on N. But the limit of these optimal functions p*j^iO) for large turns out to 
be one of the functions 

cos2^(0-0o) (2.57) 

for some positive integer m. The actual probability law for photon polarisation measurements 
is of this form, so the universe in fact does operate in a way that maximises the amount of 
information we (asympotically) obtain from photons! 

Can more of quantum mechanics be "derived" from such a principle of extremisation of in- 
formation, similar to Fermat's principle of least time? And, as in the case of light "finding the 
quickest route" between two points as a consequence of its wave nature, is there an underlying 
reason why photons maximise the amount of information we can obtain from them? In a sense, 
information is an ideal candidate for capturing the essence of quantum mechanics because it is a 
natural characterisation of uncertainty — and quantum mechanics is an intrinsically uncertain 
theory. Unfortunately, even Wootters' interesting result is not entirely convincing: his extrem- 
isation principle is valid only in a real Hilbert space; the actual quantum mechanical law does 
not maximise information in a two dimensional complex Hilbert space. 

In passing, we note that this work provided motivation for the "natural metric" on projective 
Hilbert space mentioned previously (Eqn ^]^). Suppose we know that a system is in one of two 
states, but we don't know which, and let N be the minimum number of copies of the state we 
require in order to be able to distinguish reliably between the two states (reliability is again 
defined by a fixed number of standard deviations). Then Wootters ||3^, [^] defines statistical 
distance to be 1/y/N , and goes on to show that statistical distance is proportional to the "actual 



distance" given in Eqn 2.2. This endows the actual distance with a practical interpretation. 



2.2.2 Measures of Distinguishability 

The problem of distinguishing between probability distributions has been well studied, and 
several measures of distinguishability are discussed by Fuchs |2^. Some of these will be briefly 
described below. These notions of statistical distinguishability can be easily transplanted to 
quantum mechanics to refer to quantum states: all we do is consider the distinguishability of the 
probability distributions po(&) = Tr pqE}, and piih) = Tr piEf, for some POVM {Eh}, and then 
extremise the resulting function over all POVMs. It is also useful from a practical point of view 
to ask which POVM maximises a given distinguishability measure. 

The distinguishability problem for probability distributions is formulated as follows. Our a 
priori knowledge is that a process is governed either by distribution po(b) (with probability ttq) 
or by distribution pi{b) (with probability vri). From our position of ignorance, we ascribe the 
probability distribution p{b) = iropoib) + TTipi{b) to the process. A distinguishability measure 
will be a functional of the functions po and pi, and possibly (but not necessarily) also of ttq and 
TTi — and in order to be useful this measure should have some instrumental interpretation, as 
mutual information could be interpreted as number of bits of information conveyed reliably. 

Fuchs considers five such measures of distinguishability of two statesp^: 

1. Probability of error. The simplest way of distinguishing the two distributions is to 
sample the process once, and guess which distribution occurred. Let the possible outcomes 

^^The function extremised by Wootters was in fact the mutual information between 6 and the number n of |0) 
outcomes. This is closely related to "number of distinguishable parameter values". 

■^^But recall, as mentioned earlier, that distinguishability of states in more than two dimensions is not a simple 
function of pairwise distinguishability of the constituent states; little is known of the more general problem. 
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of the sampling be {1,2,... ,n} and suppose 5 : {1,... ,n} — > {0,1} called a decision 
function, represents such a guess (that is, 5{3) = 1 means that whenever event 3 occurs we 
guess that distribution pi was the one sampled) . The probability of error is defined to be 

Pe = max7roP(5 = 1|0) + ttiP{5 = 0|1) (2.58) 
s 

where P{S = i\j) is the probability that the guess is pi when the true distribution is pj. 
This measure has the advantage that it is very easy to compute. 

2. Chernoff bound. Probability of error is limited by the fact that it is determined by exactly 
one sampling, which is not always the best way of distinguishing. It would make more sense 
to sample the process N times; in effect we will be sampling the A^-fold distributions {p^ 
or p^) once, and we can apply almost the same reasoning as above to calculate the A^-fold 
error probability. It turns out that for N — > oo, 

n 

Pe — >A^whereA= min ^po(&)>i(&)^~"- (2.59) 

b=i 

We call A the Chernoff hound, since it is also an upper bound on the probability of error 
for any A^. While perhaps having more applicability, the Chernoff bound is very difficult 
to calculate. 

3. Statistical overlap. Since A is hard to calculate we could consider bounds of the form 
Aq, similar to Eqn 2.59| which are not maximised over a. Of these, the most useful is the 



statistical overlap 



Hp^.Pi) = Yl Vp^VMbj (2.60) 

b=i 

which is also, conveniently, symmetric. This function has previously been studied in con- 
nection with the Riemannian metric on the probability simplex; so while it is not practically 
useful it has compelling mathematical application. 

4. Kullback-Leibler information. The Kullback-Leibler information was mentioned pre- 
viously in connection with mutual information (Eqn |l.28| ) : 

D{po\\pi) = j2Poi^)^og^ (2.61) 

The interpretation of this measure is slightly more abstract, and may be called "Keeping 
the Expert Honest" p^ . Suppose we have an expert weather forecaster who can perfectly 
predict probability distributions for tomorrow's weather — the only problem is, he enjoys 
watching people get wet and so is prone to distorting the truth. We wish to solve the 
problem by paying him according to his forecast, and we choose a particular payment 
function which is maximised, on average, if and only if he tells the truth i.e. if the forecasted 
probabilities usually coincide with those that occur. The Kullback-Leibler information 
measures his loss, relative to this expected maximum payment, if he tries to pass off the 
distribution pi for the weather when in fact po occurs. 



5. Mutual information. We have already encountered mutual information in a slightly 
different context, in Chapter 1. Here we consider the mutual information between the 
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outcome and the mystery distribution index, or 1: 

J = H{p)-7ToH{po)-TTiHipi) (2.62) 
= 7roD{po\\p) + 7riDipi\\p). (2.63) 

From the last form of the mutual information, we see that J is the expected loss of the 
expert when the weather is guaranteed to be either pQ or pi, and he tries to make us believe 
it's always their average, p. 

In the quantum case, we wish to minimise the first three measures (since in these cases a 
measure of corresponds to maximum distinguishability) and maximise the final two over all 



possible POVMs. There is a very thorough discussion of these extremisations in |26|. The only 
measures which have a closed form in terms of quantum states are the probability of error and 
the statistical overlap. The Chernoff bound becomes ambiguous in the quantum case, since col- 
lective measurements become possible; not only this, but the optimal single-system measurement 
depends expressly on the number of samplings allowed, so this measure loses some of its meaning. 

For curiosity we note that the statistical overlap of two quantum states has a simple closed 
form expression. The statistical overlap, maximised over all POVMs, of po and pi is 



Hpo,Pi) = Tr VPi Po^r = VF{Po,Pi), (2.64) 

where the quantity defined on the right is called the fidelity between the two states. The quantum 
statistical overlap has the following useful significance. Let IV'o) and \ jpi) be any two purifications 
(see Chapter 1) of the same dimensions of states po,pi. Then F{pQ,pi) is an upper bound to 
the overlaps |(V'o|V'i)l ^or all purifications, and moreover this bound is achievable [^], j26| |. 

The final two measures are very difficult to work with, and the best that can be done in 
the quantum situation is to find upper and lower bounds on the distinguishability, of which one 
useful bound is the Holevo bound. Another measure worth mentioning, which is related to the 
probability of error in the quantum case and has the useful property of being a metric, is 



the distortion |42]. The distortion between pQ and pi is defined to be 

c^(Po,/Oi) = IIpo - Pill (2.65) 

where || • || is some appropriate operator norm. We may choose the trace norm, ||^|| = Tr \A\, 
where \A\ = VA^A. 

One important requirement for any feature of distinguishability is its monotonicity under evo- 
lution of the quantum system. If under the action of the same superoperator two states become 
more distinguishable, then the distinguishability measure may not have a useful interpretation 
— since the second law of thermodynamics implies, roughly, that states lose information as time 
increases. In this regard, one may also investigate the change in distinguishability upon adding 
an ancilla, or on tracing out a system, or under unitary evolution. We may also ask that a dis- 
tinguishability measure be monotonic in another measure — or if this is not the case, we should 
ask where they disagree. 

2.2.3 Unambiguous State Discrimination 

All discrimination is not necessarily lost in quantum mechanics, however. Peres and Terno 
1 44 1, [^] have investigated how, given a set of linearly independent states, one might be able to 



distinguish between them. They have found that there is a method to unambiguously distinguish 
them, but that this method has a finite probability of failure unless the states are mutually 
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orthogonal. If the method always succeeded, we would have a technique for cloning nonorthogonal 
states (find out which state it is, and make replicas of it). 

Given the set {IV'i)) ■ ■ ■ ■, li^n)} of linearly independent states, consider the (n— l)-dimensiona]0 
subspace Vi = span{|'02)i • • • , \fpn)}- Let Ei be a non-normalised projection onto the 1-dimensional 
subspace V-^. Then Tr Ei\il^j) {ipj] = for all j 7^ 1: this operator unambiguously selects the 
state Continuing this way, we can generate a set of non- normalised projectors Ei, . . . , E^ 

which select their corresponding states. Surely if we now define Eq = 1 — Ej, then these op- 
erators will form a POVM — one which can discriminate unambiguously between all the given 
states! 

Unfortunately, these operators do not necessarily form a POVM. It is quite possible that the 
operator Ei + . . . + En has an eigenvalue larger than 1 — which means Eq will not be a positive 
operator. But even if it is a POVM, we have another catch: the operators Ei are not normalised, 
so Tr Ej\'ipj){'ipj\ ^1: we are not guaranteed that the POVM will recognise the state at all. 
Hence the necessity of the catch-all operator Eq which yields almost no information about the 
identity of the state. In passing we note that the actual normalisations of the (unnormalised) 
projectors Ei will be given by the requirements that Eq be a positive operator, and possibly that 
the probability of the £"0 outcome be minimised; more detail about this is to be found in 

2.3 Quantum Key Distribution 

The discrepancy between preparation information and accessible information is the basis for 
quantum key distribution (QKD). The term quantum cryptography is also used in this connection, 
but in fact the quantum aspect is merely a useful protocol within the larger field of cryptography. 
Using the quantum mechanical properties of information carriers, Alice and Bob can generate a 
random sequence of classical bits — I's and O's — which they both know perfectly and which 
they can guarantee is not known by any third party. 

If Alice and Bob share a secret key in this way, they can transmit information completely 
securely over a public (insecure) channel. They do this by using the Vernam Cipher or One- 
Time Pad, which is the only guaranteed unbreakable code known. If Alice wishes to send the 
secret message 101110 to Bob and they share a secret key 001011, then by bitwise-adding the 
message and the key she arrives at the encrypted message 100101 which she sends to Bob. By 
bitwise-adding the secret key to the message, Bob uncovers the original message. It can be shown 
that, as long as Alice and Bob use the secret key only once, an eavesdropper Eve can obtain 
no information about the message — and even if the key is partially known to Eve, there are 
protocols which Alice and Bob can use to communicate completely securely |4^]. The problem, 
as experienced by nameless spies throughout the Cold War, is how to share a secret key with 
someone when it is difficult or impossible to find a trusted courier to carry it. 

QKD solves this problem using properties of quantum mechanics. The first step in this pro- 
tocoQ requires Alice to write down two random, independent strings of n bits, as shown in 
Figure |2.3| : a value string and a basis string. She then prepares n qubits according to the bits in 
her two strings, as shown in the table below; in this table, {|0), |1)} is some orthonormal basis 
and (IV'o), iV'i)} is any other basis. 



'We assume the vector space is exactly n-dimensional. 

'This protocol is known as BB84, after its inventors Bennett and Brassard and its year of publication. 
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Alice 



Bob 



Value 



011001111010010001 



Basis A 100000110100011000 



000101 101 101 1 10010 Basis B 



Bitwise + 

Agreed Basis i OOl lO 1 lOO 1 1 lO lO 

UlOOOllO Olio 1 U Measurement 

Secret Key 110101101 
Figure 2.3: The private random bit strings required for quantum key distribution. 
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Alice sends this system to Bob, who must make a measurement on it to find out about Alice's 
preparation. This step also requires Bob to have written down a random, independent n-bit 
string, which is also a basis string. If Bob's i^^ bit is 0, he measures the i^^ string he receives in 
the {|0), |1)} basis; if this bit is 1, he measures in the {IV'o)) IV'i)} basis. If the result is |0) or 
{tpo), he writes down a as his measurement result, otherwise he writes down a 1. 

Now Alice and Bob both publicly announce their basis strings^. By performing a bitwise- 
addition of the two basis strings, Alice and Bob (and indeed anyone paying attention) can find 
out when Alice's preparation and Bob's measurement were in the same basis; this will occur on 
average half the time. When this happens Alice knows with 100% certainty what result Bob 
got from his measurement, so Alice and Bob will share one bit of a secret key. In Figure |2.3| , 
whenever a occurs in the Agreed Basis string the measurement outcomes agree, so Alice and 
Bob end up with a 9 bit secret key (in general, an n/2-bit key). 

In what sense is the key "secret"? Can Eve not surreptitiously observe the proceedings, 
intercept the states and make measurements on them? In fact Eve can do so, but the problem 
is that any measurement which obtains a non-negligible amount of information will disturb the 
message state in a detectable way. By applying certain random hashing functions on the bits 
1 47 1 and sharing the values of the hashing functions, Alice and Bob can discover if their mutual 
information is less than it should be — a sign of eavesdropping — in which case they discard 
the key. 

Considering that the original proposal for a QKD protocol was formulated by Bennett and 
Brassard in 1984 [48|, it took a long time for a general proof of the security of the protocol to 
be given. In 1998, Lo and Chau showed that if all the parties, including Eve, are equipped 



"^It is crucial that this announcement only happen after Bob has made his measurement. 
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with quantum computers, then no measurement made by Eve which yields any information 
whatever can remain undetected by Ahce and Bob. Almost simultaneously Mayers |50| inde- 



pendently showed that unconditional security was attainable without quantum computers. A 



comprehensive discussion of the nuts and bolts of QKD is given in [51|. 

The implementation of this protocol using photon polarisation as the qubit requires efficient 
single-photon detectors, long coherence times in very small photon packets and highly accurate 
polarisation control. The experimental efforts in this field have been very successful, both through 
open air and using 23 km of commercial optical fibre. A survey of results can be found in |p2||. 



2.3.1 The Inference-Disturbance TradeofF 

Information gathering causes disturbance to quantum systems. This is the principle which 
powers QKD, and the idea has been with us since the early days of quantum theory: 

. . . every act of observation is an interference, of undeterminable extent, with the 
instruments of observation as well as with the system observed, and interrupts the 
causal connection between the phenomena preceding and succeeding it. [|^, p. 132] 

This necessary disturbance is such a part of quantum physics that undergraduates are told about 
it in a first course on the theory. Unfortunately, this statement is frequently followed by an expla- 
nation that this is a consequence of the Heisenberg uncertainty principle — a misconception that 



has been with us since 1927 when Heisenberg |54] first explained his principle with a semiclassical 
modeQ 

The meaning of Heisenberg's principle is far from the idea of disturbance considered here, 
and has to do with the inability to ascribe classical states of motion to a quantum system. 
Until the observer interacts with the system the classical variables x and p are not defined for 
the system. Specifically, no matter how accurately or cunningly you prepare copies of a 
state, measurements of position on half of them and momentum on the other half will yield 
measurement statistics obeying Ax ■ Ap > h. Trying to consider the "disturbance" to a variable 
that does not have an objective meaning before, during or after the measurement is clearly an 
exercise in futility. 

The perspective of statistical inference, however, allows us to investigate the disturbance 
objectively. When Alice prepares a system in a state p, that means that she has certain predictive 
power about any measurement we might make on the system (if p is a pure state there is a 
measurement which will yield a completely certain result) . This is what allows her to synchronise 
her predictions with Bob's measurements; in fact, as long as there is nonzero mutual information 
between her preparation and Bob's measurement, they will be able to distill a shared key. We 
can measure the disturbance to the state by Alice's loss of predictive power, in a statistical sense, 
perhaps by measuring the distinguishability of the prepared state from the disturbed state. 
Similarly we can measure the inferential power that Eve gains from her interference with some 
statistical measure; perhaps we consider the mutual information between her measurement and 
the state preparation. In the eavesdropping context all Alice might want is some knowledge of 
the correlation between Alice's preparation and Bob's measurements. 



Fuchs and Jacobs [55| highlight two radically different features of this model from the "dis- 
turbance" discussed by the founding fathers of quantum theory. Firstly, in order to make this 
a well-posed problem of inference, all the observers must have well-defined prior information. 
The disturbance is then grounded with respect to Alice's prior information, and the inference 



^^In the same article quoted above, Pauli wrote that the observer can choose from "two mutually exclusive 
experimental arrangements," indicating that he also believed the disturbance was related to conjugate variables 
in the theory. 
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is grounded with respect to Eve's. In this way we avoid reference to disturbance of "the state" 
— which is after all defined in terms of our knowledge of the outcomes of the same experiments 
which "disturb" it — and we consider the predictive power of various observers — since statistical 
prediction of observers' results is what quantum theory is all about. 

Secondly one must consider at least two nonorthogonal states; the question "How much is 
the coherent state {ip) disturbed by measuring its position?" is ill-posed. This is because if we 
already know the state, we can measure it and simply remanufacture the state afterwards. 
Likewise, if we know that a system is in one of a given set of orthogonal states, we can perform 
a nondemolition measurement ||5^ and cause no disturbance to the state at all. The disturbance 
we are considering here is thus disturbance to the set of possible states. 

The situation of QKD can be described as follows. Alice and Eve have differing knowledge of a 
system's preparation, and therefore assign different states to it. In this situation, Alice has more 
predictive power than Eve. Alice passes the system to Eve, who is not happy that Alice knows 
more than her. So Eve attempts to interact with the system in such a way that she can bring 
her predictions more into alignment with Alice's, without influencing Alice's predictive power 
(so that Eve will not be detected). But, unfortunately for Eve, quantum mechanics is such that 
any such attempt to surreptitiously align her predictions with Alice's is doomed to failure. This 
seems to be a very strong statement about quantum theory. 



Few quantitative studies have been carried out to quantify this trade-off |57], |58|. A promising 
direction for this work is to consider measures of entanglement (to be discussed in the next 
chapter), and consider how entanglement may be used in improving inferential power of observers 



in quantum mechanics [d!1], |60]. 



2.4 Maxwell's Demon and Landauer's Principle 

In a sense, the considerations of information in physics date back to the 19**^ century — 
before quantum mechanics or information theory. Maxwell found an apparent paradox between 
the laws of physics and the ability to gather information, a paradox which was resolved more 
than a century later with the discovery of a connection between physics and the gathering of 
information. 

Maxwell considered a sentient being (or an appropriately programmed machine) later named 



a "Demon" whose aim was to violate a law of physics. Szilard [31| refined the conceptual 
model proposed by Maxwell into what is now known as Szilard' s engine. This is a box with 
movable pistons at either end, and a removable parition in the middle. The walls of the box are 
maintained at constant temperature T, and the single (classical) particle inside the box remains 
at this temperature through collisions with the walls. A cycle of the engine begins with the 
Demon inserting the partition into the box and observing which side the particle is on. He then 
moves the piston in the empty side of the box up to the partition, removes the partition, and 
allows the particle to push the piston back to its starting position isothermally. The engine 
supplies kBT\n2 energy per cycle, apparently violating the Second Law of Thermodynamics 
(Kelvin's form ||2^). Szilard deduced that, if this law is not to be violated, the entropy of the 
Demon must increase, and conjectured that this would be a result of the (assumed irreversible) 
measurement process. 

To rescue the Second Law — as opposed to assuming its validity and supposing that measure- 
ment introduces entropy — we clearly need to analyse the Demon's actions to discover where 
the corresponding increase in entropy occurs. Many efforts were made in this direction, mostly 
involving analyses of the measurement process [ |62| . An important step in resolving this paradox 
was Landauer's 1961 analysis of thermodynamic irreversibility of computing, which led to 
Landauer's Principle: erasure of a bit of information in an environment at temperature T leads 
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Figure 2.4: One cycle of a Szilard engine operation (after [pip . On the right is a 
phase diagram with the Demon's co-ordinates on the vertical axis and 
the box's on the left. 



to a dissipation of energy no smaller than kBT\n2. 

Bennett []6^ exorcised the Demon in 1982. He noted that measurement does not necessarily 
involve an overall increase in entropy, since the measurement performed by the Demon can in 
principle be performed reversibly. The entropy of the Demon, considered alone, does increase, 
but the overall entropy is reduced through the correlation between the position of the particle 
and the Demon's knowledge. The important realisation is that the thermodynamic accounting 
is corrected by returning the demon to his "standard memory state" at the end of the cycle. At 
this stage the Demon erases one bit of knowledge and hence loses energy at least ksTlnl; so 
the Szilard engine cannot produce useful work. 

Figure is reproduced from Bennett's paper, and follows the phase space changes through 
the cycle. In (a), the Demon is in a standard state and the particle is anywhere in the box; the 
entropy of the system is proportional to the phase space volume occupied. In (b) the partition 
is inserted and in (c) the Demon makes his measurement: his state becomes correlated with 
the state of the particle. Note that the overall entropy has not changed. Isothermal expansion 
takes place in (e), and the entropy of the particle+Demon increases. After expansion the Demon 
remembers some information, but this is not correlated to the particle's position. In returning 
the Demon to his standard state in (f) we dissipate energy into the environment, increasing its 
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(1) Direct erasure 

message / 



(a) (b) 



(1) Two-step erasure 
Figure 2.5: Two ways of erasing information. 



entropy. 



2.4.1 Erasing Information in Quantum Systems 



Szilard's analysis has been criticised by Jauch and Baron |65] for introducing an operation 
which cannot be treated by equilibrium statistical mechanics: at the moment when the piston is 



inserted the gas violates the law of Gay-Lussac [65|. These authors even go so far as to denounce 



any relation between thermodynamic entropy and informational entropy, much to the peril of 
the author of the current chapter. 

However, a more careful quantum analysis of Szilard's engine supports Szilard's idea of a 
connection between these concepts. Zurek has performed this analysis. He considers a 
particle in an infinite square well potential of width L, and the piston is replaced by a slowly 
inserted barrier of width 6 <^ L and height U ^> kT. The main result is that, in an appropriate 
limit, the system can at all times be described by its partition function Z: the thermodynamic 
approximation is valid. Zurek then analyses a "classical" demon to illustrate that Szilard's 
conclusion is correct if the Second Law is valid, and a "quantum" demon which explicitly has to 
reset itself and so demonstrates that Bennett's conclusion regarding the destination of the excess 
entropy is also correct. 

The moral of the story is that erasure of information is a process which costs free energy — or 
equivalently, which causes an increase in overall entropy. This is useful for us, because if we can 
find a way to efficiently erase information (i.e. to saturate the bound in Landauer's principle) 
then we can give a physical interpretation of the Holevo boundP^. 

Plenio exploiting an erasure protocol described by Vedral |Q, gives such an optimal 
erasure protocol based on placing the demon (or the measurement apparatus) in contact with 
a heat bath with an appropriate density operator. Plenio then describes two ways to erase 
information (illustrated in Figure |2.5| ). Alice and Bob are going to communicate by sending 
mixed states from the ensemble {pi;Pi} (so the overall state is p = YlPiPi- However, Alice is 
feeling whimsical: she decides to send pure states {(pl.) drawn from a probability distribution r^. 
These pure states are chosen so that pi = Ylk''~k\^k)i'l^k\- Once Bob receives the system, he can 
erase the informational in one of two ways: (1) by directly erasing all the pure states or (2) 
by first erasing the irrelevant information regarding which pure state he received (a) and then 
erasing the resulting known mixed state. 

By erasing in one step. Bob incurs an entropy cost Si = S{p) since he has erased efficiently. 
In part (b) of the second step, he incurs a cost S'gj, = S{pi) when erasing a particular message 

'^''This is a slightly handwaving interpretation, and certainly not rigorous. 

^^He must erase the information and return the state to Alice if this is to be a cyclic process 
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Pi, after determining that this was indeed the message Ahce sent. However, on average this step 
will cost Bob S2b = J2iPi^iPi)- 'T^^ I^ob can therefore do on step 2(a) is the difference: 

S2a = Si - S2b- (2.66) 

But Landauer's principle tells us that the best measurement Bob makes to determine the value 
of i that Alice sent can yield no more information that the entropy of erasure: 

I<S2a = S{p)-Y,P^S{Pi). (2.67) 



We thus see that the Holevo bound, proved in Section 2.1.3, which was based on an abstract 
relation satisfied by the von Neumann entropy, has an interpretation in terms of a very deep 
principle in information physics, namely Landauer's principle. This connection provides further 
support for the latter principle and possibly also some guidance for the question, still unresolved, 
of the capacity of a channel to transmit intact quantum states. 



CHAPTER 3 
Entanglement and Quantum Information 

In the previous chapter we considered how classical information is manifested in quantum sys- 
tems. One of the major results was the discrepancy between the amount of information used to 
prepare a quantum system, and that which can usefully be obtained from it. Clearly the concept 
of "classical" information is missing something, just as "classical" orbits became awkward in 
quantum mechanics. 

A "classical" information carrier is usually a two-state system, or a system with two distinct 
phase space regions. This is regarded as a resource for conveying the abstract notion of a bit. 
By direct analogy, a two-level quantum system can be regarded as a resource for carrying one 
two-level quantum state's worth of information. We call the abstract quantity embodied by this 
quantum resource a quantum bit, or qubit. 

In this chapter we will consider the properties of quantum information. The Schumacher 
Coding Theorem will give us a "system-independent" way of measuring information. We 
will discuss the relation between quantum information and quantum entanglement, and how the 
latter can be measured — and more importantly, what it can be used for. We will also look at 
whether quantum information is useful in a practical sense: can it be protected from errors? 

3.1 Schumacher's Noiseless Coding Theorem 

A first question which arises is: How does Alice send quantum information to Bob? Classically, 
she would compress the information according to Shannon's Noiseless Coding Theorem, encode 
this into the state of her classical physical systems, and convey these to Bob — we assume 
there is no noise in the channel they are using — who would perform an inverse operation to 
extract Alice's message. In the quantum case, we immediately run into a problem. As soon as 
Alice attempts to determine (through measurement) her quantum state, she has lost exactly that 
quantum information she wished to convey to Bob — we discovered this in the previous chapter. 
If Alice happens to have the preparation information in classical form, she can send this to Bob; 
but this information can be arbitrarily large. Also, there are situations in which Alice doesn't 
know what state she's got (perhaps the state is the product of a quantum computation which 
now requires Bob's input |]69||). 

An obvious solution is to simply convey the physical system to Bob. If for example the system 
is a two-level quantum system, she will incur no more effort in doing this than in conveying one 
classical bit of information, which also requires a two-state system. But once again we encounter 
a redundancy: suppose the system has four orthogonal states but we happen to know that all 
the "signal" quantum states Alice needs to send are contained in a two-dimensional subspace. 
Then sending the entire system is not necessary, since Alice could transpose the state onto a 
two-dimensional system without losing any of this "quantum information". So the question is: 
Can we generalise this result in some way? How big does the system conveyed to Bob have to 
be? Does it depend on whether the quantum states Alice wants to send are pure or mixed? 

The answer for pure states is that the system can be compressed to a Hilbert space of di- 
mension 2^(^), where once again S{p) is the entropy of the signals. And for mixed states? One 
would be tempted to guess that the answer is similar to that for pure states, with the Holevo 
information substituted for the entropy, but this is not known. This is certainly a lower bound, 
but the question becomes rather more complicated for mixed states, as will be discussed later. 
But for now, before discussing Schumacher's result, we will look at how quantum information is 
encoded, and what measure we will use for judging the accuracy of the decoded state. 
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(a) M ^ T ^ M' 



(b) M ► C+E >► C ^ C+E' M' 



E E' 

Figure 3.1: The transmission of quantum information. In (a) no information is dis- 
carded, but in (b) the communication is approximate. 



Suppose the state Alices wants to send is in the Hilbert space Hm, which we call the message 
system; this could be a subspace of some larger system space as long as we know that the signals 
are in Hm- She wishes to transfer this quantum information to a transmission system represented 
by Ht- Copying the information is not allowed, but transposition is; that is, the operation 

|aAf,OT) — > |Oj\/,aT) (3.1) 

can be performed, where |0a/) and |0t) are fixed standard states. This operation can be made 
unitary if dimiJy > dimffjv/, simply by specifying the (ordered) basis in Ht to which some 
(ordered) basis in Hm gets mapped. This is the exact analogue of classical coding. Alice 
then passes the system T to Bob, who performs an inverse swap operation onto his decoding 
system Hm' — and no quantum information has been discarded. The communication process is 
represented in Figure |3.1| (a). 

If we wish to compress the information so that it occupies a smaller Hilbert space, we will 



have to consider an approximate transposition, as represented in Figure 3.1(b). In this case our 
transmission system is decomposed into a code system He of dimension d, and an ancilla He- 
Only the code system is conveyed to Bob, and the ancilla is discarded. On the receiving end. Bob 
attaches an ancilla He' in a standard state and performs a decoding operation, which could be 
the inverse U"^ of the coding operation]^. In this case the exact form of the encoding operation 
U becomes very important; as much of the "weight" of Alice's signal states as possible must end 
up in the quotient space He that Bob receives, and indeed the dimension of this space must be 
large enough to receive almost all of the "quantum messages" Alice would like to send. 

Suppose Alice sends signal states |aAf), each with probability p[a). Then we can trace the 
path of the signal as follows: 

\aM){aM\ — > U\aM){aM\U^^ = Ila 

TlE'^a(^\QE'){^E'\=Wa 

U-'WaU = Pa. (3.2) 

So the final state that Bob receives will be a mixed state. 

Clearly Bob will not get all the quantum information Alice started with. But how do we 
measure what got to Bob? Any measure of distinguishability will work (where now Alice and 



■^''We can without loss of generality assume that the coding is unitary, but the decoding performed by Bob could 
in general be a superoperator. We will not consider this complication here. 
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Bob's aim is clearly to minimise distinguishability) . For mathematical reasons, the fidelity is 
usually used, and in this case it has a very convenient interpretation. Suppose that Bob knew 
precisely which pure state |aj\/) Alice started with — for example, they are calibrating their 
equipment. Then there is a maximal test |^] for which the outcome can be predicted with 
certainty. Suppose the (mixed) state received by Bob after decoding is p; if Bob performs this 
maximal test on the decoded state then the probability that he will think that the state is 
exactly the one sent is TV p\aM){aM\- This is therefore a practical measure of the accuracy of 
the received state. The average fidelity of the transmission is then defined as 

F = ^p(a)Tr \aM){aM\Pa- (3.3) 

a 

The fidelity is between and 1, and equal to one if and only if all signals are transmitted perfectly, 
as in Figure |3.l| (a). Note that the fidelity can also be computed in terms of the code system 
states (rather than the message and decoding systems): 

F = ^p(a)Tr \aM){aM\Pa 

a 
a 

= Y,p{a)TT UaWa. (3.4) 

a 

Schumacher then goes on to prove two lemmas: 

Lemma 1 Suppose that dim He = d + 1 and denote the ensemble of states on M by pM = 
^p(a)|aM)(aM| • Suppose there exists a projection A onto a d-dimensional subspace of Hm with 
Tr pm^ > 1 — r]. Then there exists a transposition scheme with fidelity F > \ — 2r]. 

Lemma 2 Suppose dimHc = d and suppose that for any projection A onto a d-dimensional 
subspace of Hm, Tr paiA < rj for some fixed rj. Then the fidelity F < rj. 

Both of these results are highly plausible. The first says that if almost all of the weight of 
the ensemble of message states is contained in some subspace no larger than the code system, 
then accurate transposition will be possible. This is proved constructively by demonstrating a 
transposition scheme with this fidelity. The second lemma deals with the situation in which any 
subspace of the same size as code system doesn't contain enough of the message ensemble, in 
which case the fidelity is bounded from above. 

No mention is made in these lemmas of the ensemble of states comprising pM', they could 
be arbitrary states, or they could be the eigenstates of pM', but according to the lemma the 
transposition fidelity can always be bounded as appropriate. Focus is thus shifted away from the 
ensemble, and all we have to deal with is the density operator pM- Schumacher's theorem then 
says: 

Theorem 3.1 (Quantum Noiseless Coding Theorem [|l8|) 

Let M be a quantum signal source with a signal ensemble described by the density operator pM 
and let (5, e > 0. 

(i) Suppose that S{pm) + S qubits are available per signal. Then for sufficiently large n, groups 
of n signals can be transposed via the available qubits with fidelity F > 1 — e. 
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(ii) Suppose that S{pm) — 6 qubits are available for coding each signal. Then for sufficiently 
large n, if groups of n signals are transposed via the available qubits, then the fidelity F < e. 

To prove this, we begin by noting that the maximum value of Tr pA/A is reahsed when A is 
a projection onto the subspace spanned by eigenstates of pM with the largest eigenvalues. Now, 
as in the case of the classical theorem, we consider block coding i.e. we focus on the density 
operator = p ® . . . p. The eigenvalues of this operator are products of the single system 



eigenvalues. As in the discussion of the law of large numbers following Eqn |1.18| , we can find a 
set of approximately 2"['^(^*^)"'"'^] such product eigenvalues with total sum > 1 — e/2. Then by 
Lemma 1, there is a transposition scheme with fidelity F > \ — e. 

Conversely, if we can only add together 2"['^(/'a^)~<^] of the largest eigenvalues, this sum can be 
made arbitrarily small (see for example [^], |^). Hence any projection onto a subspace of this 
size will yield arbitrarily low fidelity. 

Schumacher's result is valid for unitary encoding and decoding operations. In full generality, 
one should allow Alice and Bob to perform any operation (represented by a superoperator) on 
their systems and check whether further compression can be achieved than discussed above. This 
was considered by Barnum et al [^^, who showed that Schumacher coding gives the maximum 
possible compression. In particular, they showed that even if Alice has knowledge of which state 
she is sending, and applies different encodings depending on this knowledge, she can do no better 
than if she coded without knowledge of the signals. 

3.1.1 Compression of Mixed States 

What is the optimal compression for ensembles of mixed states? That is, what size Hilbert 
space do we need to communicate signals from the ensemble {pa',Va} accurately? Once again, 
we will measure accuracy using the fidelity, although in this case we must use the form for mixed 



states given in Eqn 2.64 



The protocol for pure state compression suggested by Jozsa and Schumacher in |71| can also 
be applied to mixed state ensembles, and compresses an ensemble down to the value of its 
von Neumann entropy. However, it is very simple to show that this is not optimal, simply by 
considering an ensemble of mixed states with disjoint support. In this case our coding technique 
could be simply to measure the system — we can distinguish perfectly between signal states 
since they are disjoint — and prepare it in a pure state, such that the ensemble of pure states 
has lower entropy 

For the problem of mixed state compression, the signal ensemble is £^ = {pa'iPa], where 
the Pa are now mixed states. In order to code this, Alice accumulates a block of n signals 
Pax ® ■ ■ - (^ Pan = '^ai where a is a multiple index. The block ensemble is <S„ = {da; <?a}, where 
is a product of the appropriate p probabilities, and it has a density matrix p„ = Yla^aPa- Alice 
will perform some operation on these states to code them. 

We now have to distinguish two types of coding: if Alice knows which states she is sending 
to Bob, then she can code each state differently and we call this arbitrary or nonblind coding; 
if she doesn't know then she treats all states equally, which is called blind coding. Alice's 
allowed operations, in the case of blind coding, are completely positive maps (represented by 
superoperators). For arbitrary coding, she can use any map; she could even, if she wanted, 
replace each code state pa by a pure state \4'a)- Bob, on the other hand, is always constrained 
to using a completely positive map, since he can have no knowledge of the state he receives. 

A protocol is then defined as a sequence (A„, of Alice's encoding operations A„ and Bob's 
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decoding superoperators $^ which yield arbitrarily good fidelity as block size increases i.e. 

i^n = J^F(pa,$n[A„(pa)]) (3.5) 
a 

satisfies F„ — > 1 as n ^ oo. The rate of a protocol V is 

R-p = lim —log dim p, (3.6) 

n^oo n 

where dim p is the dimension of the support of p. 

Horodecki then distinguishes between the passive information Ip = inf-p R-p where the infimum 
is taken over all blind protocols (where An is a superoperator), and the effective information 
which is defined similarly but with the infimum taken over arbitrary protocols (where is 
an arbitrary map). Clearly /g < Ip since all blind protocols are also arbitrary. Horodecki 
|42| shows that the effective information is bounded below by the Holevo information, xi^) ^ 



le, and notes that the bound can be achieved if the ensemble £ of code system states has 
asymptotically vanishing mean entropy (the code states become almost pure). We may also 
consider the "information defect" = Ip — Ie, which characterises the stinginess of the ensemble: 
a nonzero Id indicates that more resources are needed to convey the quantum information than 
the actual value of the information. It is not known if the information defect is ever nonzero. 

A tantalising clue is given in Schumacher's original paper where he showed that entanglement 
could be compressed using his protocol. Suppose Alice and Charlie share an EPR pair between 
them; for definiteness, suppose each of them holds an electron and the joint state of the electrons 
is 

|r) = ^(lT)AU)c7-U)A|T)c) (3.7) 

(I I) and I I) refer to spin polarisation states of the electrons). It is well-known that such a 
state exhibits strong correlations (see e.g. Bell |7^). The question is, do such correlations 
persist between Bob and Charlie if Alice conveys her system to Bob using the methods above? 
Schumacher showed that they do. We are in effect conveying the full pure state l'^") to Bob, 
but only compressing one half of it — Charlie's half undergoes the identity transformation. This 
can be done with arbitrarily good fidelity, so the correlations between Bob and Charlie will be 
arbitrarily close to the original EPR correlations. 

Horodecki |^3| then showed that in full generality, the problem of arbitrary (nonblind) com- 
pression of an ensemble of mixed states could be reduced to minimising the entropy of any 
extension ensemble. An extension of density operator p is a density operator a over a larger 
Hilbert space with the property that p = TranciUa cr, and the extension of an ensemble is another 
ensemble with the same probabilities whose elements are extensions of the original. In particu- 
lar, it seems fruitful to investigate the purifications of a given ensemble — since any purification 
is also an extension. We would then be able to use the provably optimal techniques of pure 
state coding to solve the problem. This is extremely handy reduction, enabling us to ignore the 
unwieldy representation of protocols and focus on states and their purifications. 

However, minimisation of von Neumann entropy S over all extensions is also a very difficult 
problem. A first, and counter-intuitive, result was that of Jozsa and Sclienz |34] mentioned in 



Sec. 2^. Suppose we start with a particular purification of £ and ask how we can distort the 
states, leaving the probabilities fixed, to decrease S. A reasonable first guess, based on our 
knowledge of two-state ensembles, is that any distortion which increases pairwise overlap will 
decrease von Neumann entropy. As disussed previously, Jozsa and Schlienz showed that the 
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von Neumann entropy of almost any pure state ensemble in more than two dimensions can be 
increased by making all the states more parallel. The compressibility — and the distinguishability 
— are global properties of the ensemble which cannot be reduced to accumulated local properties 
of members of the ensemble. 

There are still many unanswered questions in this area, and many tantalising clues. Several of 
these clues, together with some partial results have been collected together by Barnum, Caves, 



Fuchs, Jozsa and Schumacher |74]. One of their most striking insights is that the fidelity between 
two strings of states pa and depends on whether we are content to compute the fidelity of 
each state as it arrives (LOCAL-FID) or whether we compute the fidelity of blocks of states 
(GLOBAL-FID). The latter is a stringer condition that (LOCAL-FID) since it not only requires 
the individual states to be very similar but also their entanglements should not be altered. For 
pure states the two types of fidelity are the same, but for mixed states Barnum et al exhibit an 
example with high (LOCAL-FID) but zero (GLOBAL-FID). 

An important bound proved by Barnum et al ||7^ is that the Holevo information is a lower 
bound for compression under the criterion (GLOBAL-FID), although there are strong reasons 
to believe that under this fidelity criterion this bound is not generally attainable, but it has been 
shown that for a weaker fidelity this bound can always be achieved |75|. At the moment there 



appear to be too many trees for us to see the wood in mixed state compression. 

3.2 Entanglement 

Entanglement is one of the most intruiging aspects of quantum theory. First labeled ver- 



schrdnkung by Schrodinger [76|, entanglement became one of the features most uncomfortable 
to Einstein. It was the paper by Einstein, Podolsky and Rosen |[77[| (describing the famous EPR 
paradox) which expressed some of this dissatisfaction with the "action at a distance" of quantum 
mechanics. The crucial blow to Einstein's requirement of "local reality" came from John Bell in 
1964 l?^, and a version of Bell's theorem will be mentioned below. 

First, however, we look at what entanglement is; and we begin by looking at the canonical 
example of entangled qubits. Suppose we are considering two electrons A and B; then a basis 
for the electron spin states is 

l'A+) = ^(ITT) + IU)) 

iV'+) = ^(in) + UT)) 

l'^-> = ^(ITT)-|U» 

ir) = ^(in)-UT)) 

where, for example, we denote the state | ])a ® | Db by | tJ,). These states form the Bell basis 
(or alternatively they are EPR pairs), and they have numerous interesting properties that will 
be briefly discussed here. 

A bipartite system in a Bell state has a peculiar lack of identity. If we consider just one half 
of the system and trace out the other system's variables, we find 

Tr^|T)(T| = ^(T|T)(T|T)A + Aa|T)(T|i)A (3.8) 
= ^l = TrB|T)(T| (3.9) 
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(where |T) is one of the Bell states): so any orthogonal measurement on a single particle will 
yield a completely random outcome. Of course neither half of the system, considered alone, is in 
a pure state — so despite the fact that we have maximal information about both systems, there is 
no test we can perform on either one which will yield a definite outcome. Each of the basis states 
represents (optimally) two bits of information, which we may conveniently call the amplitude bit 
(the states both have amplitude 0, the \(p) states 1) and the phase bit (the + states have phase 
0, the — states have phase 1). The amplitude bit is the eigenvalue of the observable^ a^a^ and 
the phase bit of the observable cr^crf — and these are commuting operators. The problem in 
determining their identity comes when the systems A and B are spatially separated, because the 
local operators and which could be used to discover the amplitude bit do not commute 
with the phase operator a^af . Determining the amplitude bit through local measurements will 
disturb the phase bit; so the information is practically inaccessible. If the particles are brought 
together, we can measure the amplitude bit without determining the values of cr^ individually, 
and hence fully determine the identity of the state. 

So: what makes these states entangled? And what other states are entangled? The answer 
to these questions is rather an answer in the negative: a pure state is not entangled (i.e. it is 
separated) if it can be written as a product state over its constituent subsystems (we will mention 
mixed state entanglement later). An alert student may protest that by changing bases we could 
change a state from entangled to separated - so we should be sure of our facts first. 

Suppose l^*) is a pure state over two subsystems A and B (that is, it is a vector from Ha®Hb)- 
Then let be the basis for Ha in which the reduced density matrix for subsystem A is 

diagonal: 

PA = TrB |^')(^| = J^AiK)(i|. (3.10) 

i 

Of course if is any basis for Hb, then we can write 

i,a i 

where we have defined \i) = ^a'^i,a\'^) ■ '^^^ 10 do not necessarily form an orthonormal basis, 
but we can calculate A^s reduced density matrix in terms of them: 

= ElO^I^)^^OlB(il (3.12) 

id 

^PA = (3.13) 

k 

= |i)AA(j| (3.14) 

id 



where is any orthonormal basis for Hb- Comparing this with Eqn 3.1C , we see that 



{j\i) = XAj ■■ (3.16) 
^^The operators ai are the Pauli matrices, namely '^^ ~ i \ C))''^^~(i 0*)''^^~(o '^1/' 
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the \i) are orthogonal! Defining VAK) = 10 orthonormal basis) we find that 

l^) = J^V^K)®^')- (3.17) 

i 

This is called the Schmidt decomposition of the bipartite state|^ l^*). Notice that this decompo- 
sition is unique, and tells us also that the reduced density matrices pA and pB have the same 
non-zero eigenvalue spectrum — if Ha and Hb have different dimensions, then the remaining 
eigenvalues are zero. The Schmidt decomposition is related to a standard result in linear algebra, 
known as the singular value decomposition. 

We now have a very simple method of defining whether a pure state is entangled. Define the 



Schmidt number Ns to be the number of non-zero terms in the expansion Eqn 3.17. Then the 
state is entangled if Ns > 1. The special features of such entangled states are related to Bell's 
theorem. 



Bell's Theorem Einstein and others thought that, although quantum physics appeared to be 
a highly accurate theory, there was something missing. One proposed way of remedying this 
problem was to introduce hidden variables; variables lying at a deeper level than the Hilbert 
space structure of quantum mechanics, and to which quantum theory would be a statistical 
approximation. The idea is then that quantum mechanics describes a sort of averaging over 
these unknown variables, in much the same way as thermodynamics is obtained from statistical 
mechanics. In fact, the variables could even in principle be inaccessible to experiment: the 
important feature is that these variables would form a complete description of reality. Of course, 
to be palatable we should require these variables to be constrained by locality, to make the 
resulting theory "local realist". In the words of Einstein p. 85]: 

But on one supposition we should, in my opinion, absolutely hold fast: the real factual 
situation of the system 5*2 is independent of what is done with the system Si, which 
is spatially separated from the former. 

Einstein was convinced that the world was local, deterministic and real. 

John Bell struck a fatal blow to this view. Bell considered a very simple inequality which 
any local deterministic theory must obey — an inequality that was a miracle of brevity and could 
be explained to a high school student, and was indeed known to the 19**^ century logician Boole 
— and he showed that quantum mechanics violates this inequality. We are thus forced to one 
of several possible conclusions: (i) quantum mechanics gives an incorrect prediction, and a "Bell- 
type" experiment will demonstrate this; (ii) quantum mechanics is correct, and any deterministic 
hidden variable theory replicating the predictions of quantum mechanics must be non-local. With 
the overwhelming experimental evidence supporting quantum mechanics, it would be a foolhardy 
punter who put his money on option (i), and indeed several Bell- type experiments performed to 
date have agreed with the quantum predictions |^^. Bell's theorem refers to the conclusion that 
any deterministic hidden variable theory that reproduces quantum mechanical predictions must 
be nonlocal. 

An excellent discussion of Bell's theorem and its implications is found in Ch. 6]. Several 
papers have since appeared on "Bell's theorem without inequalities" (for a readable discussion, 
see I^H), which eliminate the statistical element of the argument and make Bell's theorem purely 
logical. In this case a single outcome of a particular quantum mechanical experiment is enough 
to render a hidden variable description impossible. 

Bipartite means that the state is a joint state over two systems. 
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The importance of entangled states, as defined above, is that every entangled pure state 
violates a Bell inequality [^2|. However, the definition of entangled states given above is not 
very helpful when we need to compare the degree of entanglement of bipartite states. However, 
we have all we need to find a quantitative measure of entanglement: a probability distribution, 
given by the eigenvalues A^. We define the entanglement of a bipartite state l^*) to be 

EiM) = H{X,) (3.18) 
= SipA) = S{pB) (3.19) 

where H is the familiar Shannon entropy. Happily, the entanglement of a state with Ng = 1 
is zero. The maximally entangled states are those of the form which are equal 

superpositions of orthogonal product states^. The Bell state basis consists of maximally entan- 
gled states, and these are the only maximally entangled states of two qubits. In analogy with 
the primitive notion of a qubit being a measure of quantum entanglement, we define an ebit to 
be the amount of entanglement shared between systems in a Bell state. 

Bennett et al [^] point out that this entanglement measure has the pleasing properties that 
(1) the entanglement of independent systems is additive and (2) the amount of entanglement 
is left unchanged under local unitary transformations, that is, those that can be expressed as 
products: U = Ua'?)Ub- The entanglement cannot be increased by more general operations either 
(see below), and has a convenient interpretation that one system with entanglement E = E{\'^)) 
is completely equivalent (in a sense to be made clear later) to E maximally entangled qubits. 



Mixed state entanglement The case for "a measure of entanglement" of a bipartite mixed 
state pab is not quite as clear cut. Indeed, just about the only useful definition is that of a 
separable state, which is one which can be written as 

PAB = ^Vi Pa® Pb- (3-20) 

i 

We will return to measures of entanglement for mixed states later, once we have developed 
more practical notions of the uses of entanglement and an idea of what we are in fact hoping to 
quantify. For now, we turn our attention to these issues. 



3.2.1 Quantum Key Distribution (again) 

We have already seen a protocol for QKD in Section 2^. Here we describe a variant based 
on Bell states, due to Ekert |^3[ which illustrates the interchangeability of quantum information 
with entanglement. 

The protocol is illustrated in Figure 3.2. We assume Alice and Bob start off sharing a large 
number n of Bell state pairs, which we will assume are the singlet state These pairs could 

be generated by a central "quantum software" distributor and delivered to Alice and Bob, or 
Alice could manufacture them and transmit one half of each pair to Bob. Once again, Alice and 
Bob have written down strings of n digits, but this time the digits could be 0, 1 or 2 — and 
once again, 0, 1 and 2 correspond to different nonorthogonal measurement bases (these bases are 
illustrated for spin-1/2 measurements in Figure 3.2). The outcome of any measurement is either 
"+" or "-". 

We denote by P the probability that Alice gets a "— " when measuring in the i basis 

at the same time as Bob finds "+" in the j basis. The correlation coefficient between two 



■^^Any one of the Bell states can be written in this way by redefining the single-particle basis; in this basis none 
of the remaining Bell states has this form. 
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Figure 3.2: Illustration of quantum key distribution using EPR pairs. 



measurement bases is 

E{i,j) = P++{i,j) + P—{i,j) - P+-{hj) - P-+{hj) 
If we do the calculations we find for our choice of bases 



Sii ■ h 



(3.21) 



(3.22) 



where aj and hj are the direction vectors of the "+" outcomes of Alice's i and Bob's j basis 
respectively. (For example, ao = (0, 1) and bo = (-^, -^)-) So we define the quantity 



S = E{0, 0) - ^(0, 2) + E{2, 0) + E{2, 2) 



(3.23) 



and calculate that for the bases illustrated in Figure 3.2, S = — 2\/2. The quantity 5 is a gen- 
eralisation, due to Clauser, Horne, Shimony and Holt [S4|, of the correlation function originally 
used by BelQ 

Once Alice and Bob have made all their measurements on the EPR pairs, they announce 
their measurement bases (i.e. their random strings of 0, 1 and 2). The only cases in which their 
measurements will agree are (i) Alice measured in basis 1 and Bob in basis 0; (ii) Alice measured 
in basis 2 and Bob basis 1. They will thus know that when these measurements were made they 
got anti-correlated results, and they can use this to generate a secret key. And how can they 
tell if someone was eavesdropping? This is where the quantity S comes in. Notice that S is 
defined only for cases where the measurement bases were different. Bob can publicly announce 
all his measurement results for when both their bases were either or 2, and Alice can use this 
to calculate the correlation functions and hence S. Since S is bounded by -2^/2 < S < 2^, 
any intereference by an eavesdropper will reduce the (anti) correlation between Alice and Bob's 
measurements and hence be statistically detectable (although this is quite difficult to prove; see 
ijSSll ). In this case Alice will abort the key distribution. 

A useful aspect of this protocol is that the "element of reality" from which the secret key is 
constructed does not come into existence until Alice and Bob make their measurements. This 



■^^ Clauser et al showed that a local deterministic theory must have \S\ < 2. Thus this particular measurement 
outcome, if it occurs, cannot be replicated by a local deterministic theory. 
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contrasts with the BB84 scheme of Section 2^, in which AHce is required to impose a known 
state onto the quantum system she sends to Bob. Practically, this means that the BB84 protocol 
is vulnerable to someone breaking into Alice's safe and (without being detected by Alice) stealing 
the preparation basis information — in which case the eavesdropper simply needs to wait for 
Alice and Bob's measurement basis announcement and can calculate the secret key. In Ekert's 
protocol Alice and Bob can hold onto their EPR pair until they need a key, then use the protocol, 
send the message and destroy the key all within a short time. 



3.2.2 Quantum Superdense Coding 

Quantum information can also be used to send classical information more efficiently. This is 
known as quantum superdense coding, and is due to Bennett and Wiesner [^ ]. 

Suppose Alice wanted to send two bits of classical information to Bob. One way she could do 
this is by sending him two electrons with information encoded in their polarisation states — and 
as we've seen already, two bits is the maximum amount of classical information she can send in 
this way. One method of doing this would be, for example, to send one of the states 
or each with probability 1/4. 

Suppose instead that Alice and Bob share the Bell state Define the following local 

actions which Alice can perform on her side of the pair: Uq = 101, Ui =(j^(8)l, U2 = dy I 
and U3 = cJz (8) 1. The effects of the operations on the shared EPR pair can be calculated: 

c/2|(A+) = -i|V-> Us\cp+) = !</)-). 

A method for sending information suggests itself: Alice and Bob take out their old EPR pairs, 
which they've had for years now, and Alice performs one of these operations on her side of the 
pair. She then sends her one electron to Bob, who makes the measurements (t^Cx ^z'^f 
on the two electrons now in his possession. As mentioned previously (Section ^^ ), all the Bell 
states are simultaneous eigenstates of these operators corresponding to different eigenvalues, so 
these measurements completely identify the state. Prom this outcome he can infer which of the 
four local operations Alice used. He has received two bits of information despite the fact that 
Alice only sent him one two-level system! 

One might argue that this still requires the transmission of both halves of the EPR pair and so 
we don't really gain anything — we send two two-level systems and get two bits of information. 
The important point here is that the shared EPR pair is a prior resource which exists before the 
classical information needed to be communicated. Once Alice knew the outcome of the soccer 
match she wanted to report to Bob, she only needed to send him one qubit. 

What is the maximum "compression factor" possible? And what other states can be used for 
superdense coding? Hausladen et al [^] showed that any pure entangled state can be used for 
superdense coding, and that the maximum information that can be superencoded into a bipartite 
state \^ ab) is E{\^ ab)) + log A^, where is the dimension of the system Alice sends to Bob. 
Clearly by sending one iV-state system Alice can communicate logA^ bits of information; the 
excess from superdense coding is exactly equal to the entanglement of the state. 



3.2.3 Quantum Teleportation 

What is so special about collective measurements, and can we draw a line between what is 
possible with separate and with joint measurements? As mentioned earlier, no local (separated) 
measurements on a system in one of the Bell states can identify which Bell state it is. Similarly, 
Peres and Wootters showed an explicit situation in which the accessible information from a 
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set of states appeared to be higher using a joint measurement than using any set of separated 
measurements. In an attempt "to identify what other resource, besides actually being in the 
same place, would enable Alice and Bob to make an optimal measurement" []87| , p. 1071] on a 
bipartite system, Bennett et al stumbled across the idea of quantum teleportation. 

We suppose, as before, that Alice and Bob share an EPR pair which we suppose to be the 
state \(j)'^)AB- Alice is also holding a quantum state \fi) = a\ 1)c + b\ |)c of another qubit which 
we call system C. Alice might not know a and b, and measuring the state will destroy the 
quantum information. However, she notices something beautiful about the state of the three 
particles A, B and C considered together: 

|</'+)AB|Ai>c = -^{a\]AU^c) + a\[A[B^c) + h\UUic) + h\[A[B[c)) 

= ^1 \<t>^)Ac{aU)B + b\i)B) (3.24) 

+ \<p-)AC {a\])B-b\l)B) 

+ \i^+)AC {b\])B + a\i)B) 

+ \^-)ac {b\^)B-a\i)B) 



Thus if Alice performs a Bell basis measurement on the system AC, she projects the system into 



a state represented by one of the terms in Eqn 3.24. In particular, Bob's half of the EPR pair 



will be projected into something that looks similar to the original state of \^)c — in fact, by 
applying one of the operators 1, a-x, (Jy, Bob can rotate his qubit to be exactly |/u)! 

The protocol runs as follows: Alice measures the system AC in the Bell basis, and gets one 
of four possible outcomes, and communicates this outcome to Bob. Once Bob knows the result 
of Alice's measurement — and not before this — he performs the appropriate operation on his 
qubit and ends up with the state | fi) . 

There are many remarkable features to this method of communication. Firstly, as long as Alice 
and Bob can reliably store separated EPR particles and can exchange classical information, they 
can perform teleportation. In particular, there can be all manner of "noise" between them 
(classical information can be made robust to this noise) and an arbitrarily large distance. Alice 
doesn't even need to know where Bob is, as she would if she wished to send him a qubit directly 
— she merely needs access to a broadcasting channel. 

Secondly, the speed of transmission of the state of a massive particle is limited only by the 
speed of classical communication — which in turn is limited only by the speed of light. This is 
an intriguing subversion of the idea that massive particles must travel slower than light speed, 
although of course, only the state is transmitted. We could even imagine a qubit realised as an 
electron in system C and as a photon in system B, in which case we teleport a massive particle 
onto a massless particle. Quantum information is a highly interchangeable resource! 

Thirdly, we have developed a hierarchy of information resources. The lowest resource is a 
classical bit which cannot generally be used to share quantum information and which cannot be 
used to create entanglement (which is a consequence of Bell's theorem) . The highest resource is a 
qubit, and an ebit is intermediate. This classification follows from the fact that if Alice and Bob 
can reliably communicate qubits, they can create an ebit entanglement (Alice simply creates an 
EPR pair and transmits half to Bob) ; whereas if they share an ebit they further require classical 
information to be able to transmit qubits. 

Much work has been performed in attempting to implement quantum teleportation in the 



laboratory, with varying claims to success. Some of the techniques used are cavity QED |88|, 
parametric down-conversion of laser light [^] and NM R [pO| ] . Experiments are however fraught 
with complications, both practical and theoretical; see |^l| and references therein. 
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Figure 3.3: A Turing machine. 



Entanglement swapping Quantum teleportation can also be used to swap entanglement [92|. 
Suppose Alice and Bob share the EPR pair | (/>"'") and Charlie and Doris share another Then 
the state they have can be rewritten as 

\<I)'^)ab\(P'^)cd = ^{IIaIbIc^d) + \^a'IbIcId) + \IaIb^c^d) + \IaIbIcId)) 

= ^{\(I)'^)ad\<P'^)bc + \<P')ad\4>')bc 

+|V'+)ad|V'+)bc + \i^-)AD\i^-)Bc) ■ (3.25) 

So if Bob and Charlie get together and make a Bell basis measurement on their qubits, they 
will get one of the four outcomes with equal probability. Once they communicate this classical 
information to Alice and Doris, the latter will know that they are holding an EPR pair — and 
Alice can convert it to l^"*") by a local rotation. 

Of course, all that Bob has done is to teleport his entangled state onto Doris' system, so 
entanglement swapping is not much different to straightforward teleportation. But entanglement 
swapping will be important later when we discuss quantum channels. 

3.3 Quantum Computing 

The theory of computing dates from work by Turing and others in the 1930's. However the 
idea of a computer, as enunciated by Turing, suffered from the same shortcoming as classical 
information: there was no physical basis for the model. And there were surprises in store for 
those who eventually did investigate the physical nature of computation. 

Turing wanted to formalise the notion of a "computation" as it might be carried out by a 
mathematician in such a way that a machine could carry out all the necessary actions. His model 
was a mathematician sitting in a room with access to an infinite amount of paper, some of which 
contains the (finite) description of the problem he is to solve. The mathematician is supposed to 
be in one of a finite number of definite states, and he changes state only after he reads part of 
the problem statement — and as he changes state he is allowed to write on the paper. A Turing 



machine (illustrated in Figure 3.3) is the machine version of this mathematician. The Turing 



machine (TM) has a finite number of internal states, here labeled a,b, . . . ,1; a "read" head; and 
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a "write" head. An infinite strip of paper, with discrete cehs on it, runs between the read and 
write heads; each cell can either contain a 0, a 1 or a blank. Initially the paper is completely 
blank except for a finite number of cells which contain the problem specification. A cycle of 
the machine consists of reading the contents of the current cell, writing a new value into the 
cell, changing state, and moving the tape left or right. For example, consider the instruction set 
shown in the figure. The machine is currently in the state a and is reading the value 0; according 
to the instructions given, the machine should write a 1, change into state g and shift the tape to 
the left so that the read head sees a 1. 

The Turing machine can be used to define the complexity of various problems. If we have 
a TM which, say, squares any number it is given, then a reasonable measure of complexity is 
how long (how many cycles) it takes for the Turing machine to halt with the answer. This of 
course depends on the size of the input, since it's a lot harder to square 1564 than 2. If the input 
occupied n cells and the computation took 3n^ steps (asympotically, for large n) we refer to the 
computation as having complexity or more generally polynomial-time complexity. Of course, 
the complexity of a problem is identified as the minimum complexity over all computations which 
solve that problem. The two chief divisions are polynomial-time and exponential-time problems. 

Happily, complexity is relatively machine- independent: any other computational device (a 
cellular automaton, or perhaps a network of logic gates like in a desktop PC) can be simulated 
by a TM with at most polynomial slowdown. And in fact there exist universal Turing machines 
that can simulate any other Turing machine with only a constant increase in complexity — the 
universal TM just requires a "program" describing the other TM. Turing's theory is very elegant 
and underpins most of modern computing. 

However, we can also start to ask what other resources are required for computing. What is 
the space complexity of a problem — how many cells on the strip of paper does the computation 
need? Frequently there is a trade-off between space and time in a particular computation. More 
practically, we can ask what are the energy requirements of a computation [^]? What must 
the accuracy of the physical processes underlying computation be? Questions like these brought 
complexity theorists to the limits of classical reasoning, beyond which the ideas of quantum 
theory had to be employed for accurate answers. 

Feynman [^] first noted that simple quantum systems cannot be efficiently simulated by 
Turing machines. This situation was turned on its head by Deutsch when he suggested that this is 
not a problem but an opportunity; that the notion of efficient computing is not entirely captured 
by classical systems. He took this idea and proposed first a quantum Turing machine (which 
can exist in superpositions of states) and then the quantum computational network which 
is the most prevalent model for quantum computing today. Several problems were discovered 
which could be solved faster on quantum computers (such as the Deutsch- Jozsa problem [|96|), 
but the "silver bullet" application was an algorithm for polynomial-time factorisation of integers 
discovered by Shor |97] (for a readable account see [^). The assumed exponential complexity 
of factorisation is the basis for most of today's public-key encryption systems, so a polynomial 
algorithm for cracking these cryptosystems generated a lot of popular interest. 

There are many questions which arise about the usefulness of quantum computation. For 
example, how much accuracy is required? What types of coherent manipulations of quantum 
systems are necessary for completely general quantum computation? And which problems can 
be solved faster on a quantum computer as opposed to a classical computer? 

The first question here requires knowledge of quantum error correction techniques, and will 
be discussed in Section 3.4.2| . 

There is a very simple and intriguing answer to the second question. To compare, we first 
consider what operations are required for universal classical computation. If we allow our classical 
computation to be logically irreversible (i.e. we are allowing information to be discarded) then 
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Figure 3.4: Gates for universal computation: (a) for irreversible classical computa- 
tion (b) for reversible classical computation (c) for quantum computa- 
tion. 



it suffice^^ to have the operations COPY and NAND, which are illustrated in Figure 3.4. In 
these diagrams, the bits being operated on "travel" from left to right through the gate, and have 
an effect described by the outputs; for instance if the Boolean values x = 1 and y = 1 are input 
into a NAND gate, the output will bex©y = l©l = 0. Other sets of gates are also universal. 
For classical reversible computation, the three-input Toffoli gate is universal. This gate is also 
called a "controlled-controlled-NOT" (C^-NOT), since the "target" bit is negated if and only if 
the two "source" bits are both 1; otherwise the target bit is untouched. 

In quantum computing we are no longer dealing with abstract Boolean values, but with two- 
level quantum states whose states we denote |0) and |1). Note that mathematically, any Hilbert 
space can be considered as a subspace of the joint space of a set of qubits, so by considering 
operations on qubits we are not losing any generality. The most general quantum evolution 
is unitary, so the question is: What type of qubit operations are required to implement an 



arbitrary unitary operator on their state space? Deutsch [95| provided an answer: all we need 
is a "controlled-controlled-i?" , where R is any qubit rotation by an irrational fraction. A more 



These gates are all universal iff a supply of bits in a standard state (0 or |0)) are available as well. 
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convenient set |99] is that shown in Figure ^^(c), the set comprising the C-NOT and arbitrary 
single qubit rotations (such rotations are elements of U{2) and can be represented by four real 
parameters). And interestingly, any single unitary operation acting on two or more qubits is 
generically universal |100| ]. While this is a mathematically simple remark, physically this means 
that any unitary operation in d dimensions can be carried out using a single fixed "local" rotation 
— where by local we mean "operating in exactly four dimensions" (two qubits). 

The reader may object that certain gate sets will be more practical than others; in particular, 
attempting to compute using just one two-qubit gate looks like an uphill battle compared with, 
say, using the gate set shown in Figure p.4K c). However, the issue of complexity is implicit 
in the definition of "universal": a gate set is only universal if it can simulate any other gate 
set with polynomial slowdown; indeed, in the analysis of any proposed quantum algorithm, the 
complexity of the required gates must be taken into account. Such issues are addressed in e.g. 

The answer to the final question posed above is difficult, and is related to the open problem in 
computing theory of classifying problems into complexity classes. Not all classically exponential- 
time problems can be reduced to polynomial-time on a quantum computer — in fact there appear 
to be very few of these, although many problems admit a square-root or polynomial speed-up. 
General mathematical frameworks for the quantum factorisation problem are given by Mosca 



and Ekert [101] and for the quantum searching problem by Mosca |102| ]. The relation of current 



quantum algorithms (in the network model mentioned above) to the paradigm of multi-particle 



interferometry is given by Cleve et al [103]. 



3.4 Quantum Channels 

The idea of a quantum channel was implicit in the discussion of quantum noiseless coding: 
the channel consisted of Alice sending Bob a "coded" physical system through an effectively 
noiseless channel. This is an idealisation; any means that Alice and Bob share for transmitting 
information is subject to imperfection and noise, so we should analyse how this noise can be 
characterised and whether transmission of intact quantum states can be achieved in the presence 
of noise. Ideally we would like to arrive at a "quantum noisy coding theorem" . 

Note that, just as in the classical case, error correction in quantum channels is not solely 
important for the obvious task of Alice sending information to Bob. Coding techniques are also 
used to protect information which is "transmitted" through time — as for example the data 
stored in computer memory, which is constantly refreshed and checked using parity codes. If 
we are to implement useful quantum computations, we will need similar coding techniques to 
prevent the decoherence (loss into the environment) of the quantum information. 

It is instructive to look first at how classical communication and computation are protected 
against errors. In general the "imperfection and noise" can be well characterised by a stochastic 
process acting on the information, whose effect is to flip signal i to signal j with probability p{j\i). 
We can remove the errors through mathematical or physical techniques; by using redundancy or 
by engineering the system to be insensitive to noise. This insensitivity is a result of combining 
amplification with dissipation |104]: the amplification provides a "restoring force" on the system, 



while dissipation causes oscillations away from equilibrium to be damped. This is the primary 
source of reliability in a modern classical computer. Unfortunately, it is precisely these types of 
robustness which are not available in quantum systems; we cannot amplify the state by copying 
it, and we cannot subject it to dissipative (non-unitary) evolutions. 

What about redundancy? This plays a more important role in classical communication than 



in computing. Error-correcting codes — a vast subject of research in itself [105] — were at 



first thought to be inapplicable to quantum systems for the same reason that amplification is 
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not allowed. However, once it was realised that redundancy could be achieved by encoding the 
state of a qubit into a much larger space (say that of five qubits) which is more robust against 
decoherence, the techniques of classical coding theory could be used to discover which subspaces 
should be used. A quantum theory of error correction has thus grown with techniques parallel 



to the classical theory, as will be discussed in Section 3.4.2 



3.4.1 Entanglement Purification 

One way to transmit quantum information is to teleport it. In a certain sense this is a circular 
argument, because for teleportation Alice and Bob require shared entanglement — a resource 
which requires transmission of coherent quantum information in the first place. Fortunately, 
Bennett et al have had a word with Alice and Bob and told them how to go about getting almost 



pure Bell states out of a noisy channel |^ | . 



Before we discuss this, we need to know what operations are available to Alice and Bob. It 
turns out that all we will need for this entanglement purification protocol (EPF) is three types of 
operations: (i) unilateral rotations by Alice using the operators ax, CTy, az', (ii) bilateral rotations, 
represented by Bx, By, B^ which represent Alice and Bob both performing the same rotation on 
their qubits; and (iii) a bilateral C-NOT, where Alice and Bob both perform the operation C- 
NOT shown in Figure ^^(c), where both the members of one pair are used as source qubits 
and both qubits from another pair are used as target qubits. The effects of these operations are 
summarised in the table in Figure |3.5| (complex phases are ignored in this table). For example, if 
Alice and Bob hold a pair AB of qubits in the state and either Alice or Bob (but not both) 
perform the unitary operation ay on their half of the pair, they have performed a unilateral vr 
rotation. According to the table, pair AB will now be described by the state If they had 

both performed the same ay operation on their respective qubits (a bilateral By operation) then 
we see from the tabel that the state of AB would remain 

Note how the bilateral C-NOT is implemented: Alice prepares systems AB and CD in Bell 
states, and sends B and D to Bob. Bob uses B as source and D as target in executing a C-NOT, 
while Alice uses A as source and C as target (this is illustrated in Figure |3.6| ) . In general, both 
pairs AB and CD will end up altered, as shown in the table in Figure |3.5| . For example, if the 
source pair AB is prepared as and the target pair CD as \4>'^), then we conclude from the 
table that after this operation the pair AB remains in the state l^"*"), but the pair CD is now 
described by IV'^) as well. 

The most general possible way of describing the noise is as a superoperator $ acting on the 
transmitted states. We suppose that Alice manufactured a large number of pairs of systems 
in the joint state and when she sent half of them through the channel to Bob they were 

corrupted by the joint superoperator 1 (8) $ (Alice preserves her half perfectly). We are only 
interested in the state pab which emerges. This matrix, expressed in the Bell basis, has three 
parts which behave differently under rotation: the behaves as a scalar, 3 terms of the 

form (Iv) is one of the Bell states) which behave as a vector, and the 3x3 block which 

behaves as a second rank tensor. Bennett et al showed that through a random operation of 
twirling any state pab can be brought into the so-called Werner forrr^^ 

Wf = | + 1^ (|V'+)(V'+I + )(V'-| + \4>~){r\) ■ (3.26) 

This "twirling" is done by performing a type of averaging: for each EPR pair she generates, 
Alice chooses randomly one element from a set of specially chosen combinations of the operations 
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This is not exactly the Werner form; usuaUy the singlet state ) is the distinguished state. The state in 



Eqn 3.26 differs from a standard Werner state by a simple unilateral rotation. 
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Figure 3.5: The unilateral and bilateral operations used in entanglement purification 



(after [47|). 



Bx, By, Bz, tells Bob which one it is, and they perform the specified operation. The "twirls" are 
chosen so that the second rank tensor representing the states, when acted upon by these random 
twirls, becomes proportional to the identity and the vector components disappear. The idea is 
similar to motional averaging over the directional properties of a fluid, where all vector quantities 
are zero and tensor properties can be described by a single parameter. 

Alice and Bob can now consider their ensemble of EPR pairs as a classical mixture of Bell 
states, with proportion F of the state and of each of the other states. This is enormously 
useful because they can employ classical error correction techniques to project out and find pure 
singlet states. There is a price paid, though: we have introduced additional uncertainty, and this 
is reflected by the fact that S{Wf) > S{pab)- It can be shown that the protocol described below 
works without the twirl, but this is a more subtle argument. Note that F is the fidelity between 
the Werner state and the state \4>~^)'- F = {(j)~^\WF\(l)'^) , where the fidelity of a transmitted state 



was defined in Eqn |2.64 . 

After this pre-processing stage, Alice and Bob perform the following steps: 

1. They select the corresponding members of two pairs from their ensemble. 

2. They perform a bilateral C-NOT from the one pair to the other (illustrated in Figure |3.6| 

3. They make (local) measurements on the target pair to determine its amplitude bit. 
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Figure 3.6: An illustration of the bilateral C-NOT operation, and its effect on the 
fidelity of the Werner state (taken from [|47|]). 



4. If the amplitude bit is 1 (i.e. if the target pair is in one of the (p states) then both pairs are 
discarded. If the amplitude bit is 0, which occurs with probability Ppass — 
-( 



F) + 1(1 — F)^, the source pair is in the state Wp' with 



F2 + lF{l-F) + l{l-FY ^^-"'^ 
Proof of this relation is left as an exercise for the reader, using the information from 



Figure and Eqn 3.26 



How exactly does this help us? Well the answer is shown in Figure 3.(:: for all starting fidelities 
F > 1/2, we have that F' > F. So with probability of success Ppass ^ 1/2 we will have succeeded 
in driving the Werner state closer to a pure Bell state. 

We may want to know how many EPR pairs m can be distilled from n copies of the intial 
state pAB as n gets large; that is, we define the yield of a protocol P to be 

Dp{pab) = lim m/n. (3.28) 

n— »oo 

Unfortunately the yield of this protocol is zero. At each step we are throwing away at least half 
of our pairs, so this process is very wasteful. However, this protocol can be used in combination 
with other protocols to give a positive yield. In particular, Bennett et al describe a variation 
of a classical hashing protocol which gives good results. 

An important feature to note about these protocols is that they explicitly require two-way 
communication between Alice and Bob, and so we may call them 2-EPP. What about 1-EPP — 
those protocols which use only communication in one direction? There is certainly an important 
distinction here. 1-EPP protocols can be used to purify entanglement through time, and thus to 
teleport information forward in time. This is exactly what one wishes to achieve using quantum 
error-correcting codes (QECC). On the other hand, a 2-EPP can be used for communication but 
not for error-correction. In fact, we can define two different rates of communication using EPP, 
for a particular channel desribed by superoperator $: 

£>!($) = sup Dp{pab) (3.29) 

P is a 1-EPP 

D2{%) = sup Dp{pAB). (3.30) 

P is a 2-EPP 
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(Note that one EPR pair allows perfect transmission of one qubit, using teleportation; hence 
the number of Bell states distilled can be interpreted as a rate of transmission of quantum 
information.) Clearly, since all 1-EPP are also 2-EPP, we have D2 > D\. It can be shown that 
for some mixed states pab — or equivalently, for some channels $ — there is a strict separation 
between Di and D2; in fact there are some mixed states with Di = and D2 > |47]. 

We can see now the importance of the entanglement swapping protocol presented earlier. For 
most quantum communication channels, the fidelity is an exponentially decreasing function of 
distance I. Ususally this is described by a coherence length Iq. However if we divide the length 
/ into intervals, the fidelity change over each interval is exp^—l/Nlo), which can be made as 
close to unity as required. Using EPP we can improve the fidelity over each interval as much 
as required and then perform the entanglement swapping protocol approximately log2 N times 
to achieve high fidelity entanglement between the beginning and end points |107|] . Using this 
technique, the fidelity can be made into a polynomially decreasing function of distance, which 
shows that is is possible in principle for us to implement long-distance quantum communication. 

Two other EPPs worth noting are the Procrustean method and the Schmidt projection method 
|106| ] — although this more for their historical interest. They are both designed to operate on 
non-maximally entangled pure states. 



3.4.2 Quantum Error Correction 

At first sight, quantum errors don't seem amenable to correction — at least not in a similar 
sense to that of classical information. Firstly, quantum information cannot be copied to make 
it redundant. Secondly, quantum errors are analogue (as opposed to digital) errors: it can make 
an enormous difference if the state a|0) + b\l) is received as (a + ei)|0) + (6 — £2)!!), whereas in 
classical error correction we just need to prevent a system in the state '0' from entering the state 
'1' and vice versa. 

The first problem is countered by using logical states, which correspond to the |0) and |1) 
states of a two-level system, but which are in fact states of a much larger quantum system. In 
effect, we can use n qubits with a 2^-dimensional Hilbert space, but only employ two states \0l) 
and in communication. We aim to design these states so that a\OL) + is recoverable 
despite errors. 

The second problem is solved by realising that errors can be digitised. Suppose we make a 
measurement onto our redundant system, and we choose our POVM to be highly degenerate, 
so that the only information it yields is which subspace the system is in. If we choose our 
redundancy carefully, and make a well-chosen measurement, we will project (and so discretise) 
the error and the measurement result will tell us which error has occurred in this projected 
system. This will be elaborated below, after we have discussed the type of errors we are dealing 
with. 

As mentioned previously errors are introduced through interaction with the environment, 
which can be represented by the action of a superoperator on the state we are transmitting. 
Rather than using the operator-sum representation we will explicitly include the environment in 
our system — and we lose no generality in representing a superoperator if we assume that the 
environment starts out in a pure state |eo). Then the state of the code system which starts as 
|(/>) will evolve as 

|eo)|0) ^ J^|efc)5fc|0) (3.31) 

k 

where the Sk are unitary operators and the \ek) are states of the environment which are not 
necessarily orthogonal or normalised. The Sk are operators on the code system Hilbert space. 
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and form a complex vector space; so we choose some convenient basis for the space consisting of 
operators Mg: 

Sk = Y,°^ksMs (3.32) 

s 

where a^fc are complex constants. If we define le^) = X^fc'^fcsl^fc); then the final state of the 



system+environment in Eqn 3.31 will be Yls Ws)^s\(t>) ■ Once again the environment state vectors 
are not orthogonal or normalised, but we can now choose the operators Ms to have convenient 
properties. 

We note first of all that not all the errors can be corrected. This follows because not all 
the states Ms\(j)) can be orthonormal, so we cannot unambiguously distinguish between them to 
correct Mg. So we do the next best thing: we choose a certain subset M (called the correctable 
errors) of the operators Mg and seek to be able to correct just these errors. 

What do we mean by "correcting errors"? Errors will be corrected by a recovery operator, a 
unitary operator TZ which has the following effect: 

7^|a)M,|(/.) = \as)\(t)) (3.33) 

where \a) and |a<j) are states of everything else (environment, ancilla, measuring apparatus) and 
Ms is any correctable error. In general the states la^) are not going to be orthogonal, but the 
important feature is that the final state \(j)) of the code system must not depend on \as) — if it 
did depend on the code system state then error recovery would not work on linear combinations 
of states on which it does work, so universal error correction would be impossible. 

A quantum error- correcting code (QECC) is a subspace V of the 2"'-dimensional space de- 
scribing n qubits together with a recovery operator IZ. This QECC must satisfy the following 
conditions: for every pair of code words |w) G V with {u\v) =0 and every pair of correctable 
errors Ms,Mt e M, 

{u\MlMt\v) = (3.34) 
{u\MlMt\u) = {as\at) (3.35) 



with Eqn 3.35 holding independent of the code word \u). These conditions imply that errors can 
be corrected, as can be seen in the following way. For any code words the recovery operator acts 
as 

TZ\a)Ms\u) = \as)\u) n\a)Mt\v) = \as)\v). (3.36) 

Taking the inner product on both sides between these we find that 

{u\Ml{a\n^n\a)Mt\v) = {as\at){u\v) (3.37) 
\Ml{a\a)Mt\ v) = {as\at)6uv (3.38) 

where 7^1"7^ = 1 since TZ is unitary. The requirements listed above follow if {u\v) = or u = v 
respect vely. 

By reversing the above arguments, we find that Eqs 3.34 and |3.35| are also sufficient for the 



existence of a QECC correcting errors Ai. The aim in designing QECC's is thus to identify 
the most important or likely errors acting on n qubits, identify a subspace which satisfies the 
requirements above, and find the recovery operator. This can be a complex task, but fortunately 
many of the techniques of classical coding theory can be brought to bear. We will briefly sketch 
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the most common ideas behind classical error correction and indicate how these ideas may be 
modified for QECC. 

The simplest error model in classical coding theory is the binary symmetric channel. Binary 
refers to the fact that two signals are used, and symmetric indicates that errors affect both signals 
in the same way, by flipping 0<->l; in addition, we assume that the errors are stochastic i.e. affect 
each signal independently and with probability p. In this case we can apply the algebraic coding 
techniques pioneered by Hamming |105| ] to design parity check codes. The classical result is that, 
by using a k bit (2^-dimensional) subspace of words of length n (with k < n), we can design a 
code that corrects up to t errors; the Hamming bound on linear codes is 

t 

k<n- log2 Y.iV (3.39) 

i=0 

where the term inside the sum is the binomial coefficient. Essentially the sum in this expression 
counts the number of different ways in which up to t errors can occur, and this is a lower bound 
to the number n — k ol redundant bits required for error correction. 

Similar algebraic techniques can be applied to a quantum generalisation of the binary sym- 
metric channel. In this case we take as a basis for the errors the qubit operators / = 1, X = cjx, 
Y = —idy and Z = Oz (we have introduced a factor of —i into Y for convenience); explicitly, 

-(J?) '^■^°> 

These operators correspond to a bit fiip 0^1 {X\ a phase fiip — > 0, 1 — > — 1 (Z) or both 
(y = XZ~) — a much richer set of errors than in the classical case! We will use a system of n 
qubits, so a typical error will be Mg = I1X2Z3 . . . Yn^iln- In analogy with the classical case, 
a sensible choice of correctable errors will be those with weight < t, where the weight is the 
number of single qubit error operators which are not I — although if we have more sophisticated 
knowledge of the noise, such as correlations between qubits, we would hope to include these 
as well. For example, a single error-correcting QECC can be constructed using 5 qubits, to be 
compared with a classical single error-correcting code which uses 3 bits (it only corrects X errors) 

An interesting feature to note about a t-error correcting QECC is that the information is 
encoded highly nonlocally. For example, suppose we have encoded a single qubit into 5 qubits 
using the optimal 1-error correcting code. If we handed any one qubit over to an eavesdropper 
Eve, then the fidelity of our quantum system after recovery can still be arbitrarily high — and so 
by the no-cloning theorem Eve will not have been able to derive any information from her qubit. 
Contrast this with the classical case, in which, if Eve knew that the probability of error on any 
single qubit was p <C 1, she could be confident that the signal was what she measured. In fact, 
if in the course of recovery we make a measurement (as will usually be the case; the operator 
TZ above includes this possibility) on the qubits, the result cannot reveal any information about 
the logical state encoded in the set of qubits, for otherwise we will have disturbed the quantum 
information. These are important features of quantum error correction. 

A pedagogical survey of QECC and coding techniques can be found in ||2^, and a good set 
of references is given in the appropriate section of |104]. 



Quantum channel capacity So where does all of this leave us in terms of channel capacity? 
Our aim would be to characterise the number of qubits which can be recovered after the action 
of a superoperator $ with arbitrarily good fidelity. 
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The quantum version of the problem of channel capacity is mathematically a great deal more 
complicated than the classical version, so we would expect an answer to this question to be 
more difficult to reach. But we are faced with even more complication: the capacity depends 
on what other resources are available. For example, there are channels which cannot be used 
to communicate quantum states if only one-way communication is allowed, but have positive 
capacity if two-way communication is allowed |^^. On the other hand, t-error correcting QECCs 
exist with an asymptotic rate k/n = 1 — 2H{2t/n) |10S| ] , where H is the entropy function — 
so if the errors are uncorrelated and have low enough average probability on each single qubit, 
we can transmit quantum information with arbitrarily good fidelity. The Shannon strategy of 
random coding is found to fail when transplanted into quantum information terms. Altogether, 
there are very few answers to the general characterisation of quantum channel capacity. 



Fault tolerant quantum computing Another, possibly fatal, criticism can be found for 
quantum error correction ideas. A QECC uses extra qubits to encode one qubit of information, 
and requires us to "smear" the information over these other systems. This will require interaction 
between the qubits, and such interaction will in general be noisy; and extra qubits mean that 
each qubit has the potential to interact with the environment — noise which can spread through 
interactions between qubits. So the burning question is: Can we, by introducing more qubits 
and more imperfect interactions, extend a computation? Can we decrease the logical error rate? 

This is a serious problem, which initially seemed like a death blow to implementing quantum 
computing even in principle. However, careful analysis has shown that coherent quantum states 
can be maintained indefinitely, despite imperfect gates and noisy qubits, as long as the error 
rate is below a certain threshhold. Such analyses are the subject of the enormous field of fault 
tolerant quantum computing. Further discussion of this topic can be found in |109|. 



3.5 Measures of Entanglement 

With some idea of what entanglement can be used for and how to interchange forms of 
entanglement, we can return to the problem of quantifying an amount of entanglement. 

We can immediately define two types of entanglement for any bipartite quantum state pab- 
the entanglement of formation Ep, which is the minimum number of singlet states required 
to prepare pab using only local operations and classical communication (LOCC); and the en- 
tanglement of distillation E^, which we define to be the maximum number of EPR pairs which 
can be distilled from pab — with the maximum taken over all EPFs. 

Bennett et al\^\ showed the very useful result that for pure states \iPab) of the joint system, 



the following result holds: 

Ef = E{\^lJAB)) = Ed (3.41) 



where E{-) is the entanglement measure proposed earlier in Eqn 3.1£ . The first equality follows 
from the fact that Alice can locally create copies of \i1^ab), compress them using Schumacher 
coding, and teleport the resulting states to Bob who can uncompress them; the second equality 
can be shown to hold by demonstrating an EPF which achieves this yield. This is the sense, 
alluded to previously, in which a single system in the state \iPab) can be thought of as equivalent 
to E singlet states. This relation immediately allows us to simplify the definition of Ep for mixed 
states. If we denote by £p = {IV'ab)'^'*} ensemble of pure states whose density operator is p, 
then the entanglement of formation of p is 

EF{p)=unnY,mAB))- (3.42) 
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This is justified by the fact that if we have Ef{p) singlets, then AHce can teleport a probabilistic 
mixture of pure states to Bob to end up with the density operator p. Wootters [110| has derived 



an exact expression for the entanglement of formation of any state of two qubits. 

Notice that in general Ep > Ed, since if this were otherwise, we could form an entangled 
state from Ep EPR pairs and distill a greater number of EPR pairs from that! Evidence suggests 
|47| that the inequality is strict for all mixed states over the joint system. 



Recall the definition in Eqn 3.2C of a separable bipartite state. We might ask whether there 
is any way to determine whether a state is mixed without laboriously writing it in this form; and 
we would be stuck for an answer. 

Firstly, we might suppose that such states would always exhibit nonlocal correlations such 
as in Bell's inequalities. This attempt fails because there are mixed states (in fact, the Werner 
states introduced earlier) which do not violate any Bell's inequality because they admit hidden 



variables descriptions [111|. Indeed, there are separable states which have nonlocal properties; 
Bennett et al demonstrated a set of bipartite states of two 3-state systems which can be 
prepared locally but which cannot be unambiguously discriminated by local measurements. 

Secondly, we could ask whether all non-separable states can be used for something like tele- 
portation — since the Werner states can indeed be used for this [|112(|. We would once again 



encounter a blind alley, since there are states which are nonlocal but cannot be used in quantum 



teleportation |113| . This entanglement is termed "bound", in analogy with bound energy in 
thermodynamics which cannot be used to do useful work — this entanglement cannot perform 
any useful informational work. 

Note that if a state can be used for teleportation, it can be purified to singlet form (by 
Alice teleporting singlets to Bob). Obviously, if a state can be distilled then it can be used for 
teleportation; so there is an equivalence here. 

In the spirit of trying to figure out what entanglement means, DiVincenzo et al suggested 



another measure of entanglement, the entanglement of assistance |60|. It is defined very similarly 
to entanglement of formation: 

EA{p)=mB.^Y.^[WAB))- (3.43) 

Obviously this function will have properties dual to those of Ep- It has some expected properties 
— such as non-increase under LOCC — but some unexpected features too. Ea also has an 
interpretation in terms of how much pure state bipartite entanglement can be derived from a 
tripartite state. And Vedral et al |114] have found a whole class of entanglement measures based 



on quantifying the distance of a state from separability. 

There is still much room for new ideas in quantifying of entanglement, and in finding uses for 
it — qualifying entanglement. In a certain sense entanglement is a "super" classical correlation: 
it holds the potential for classical correlation, but that potential is only realised when a mea- 
surement is made. On the other hand it is also a very simple concept: it represents the fact that 
systems do not exist in classically separated states. Entanglement is merely a manifestation of 
this unity. 



CHAPTER 4 
Conclusion 



David Mermin [115| once identified three post-quantum theory generations of physicists according 
to their views on the quantum. The first, who were around when the theory developed (which 
includes the "Founding Fathers" ) , were forced to grapple with the elements of the theory to give 
it a useful interpretation. Once a workable theory was developed, however, this generation came 
to regard the quirks of quantum theory as due to some deeply ingrained "classical" modes of 
thought and hence to be expected. The second generation — their students — took this lesson 
further by insisting that there was nothing unusual about quantum mechanics, and tried to make 
it mundane; and indeed, this "shut-up-and-calculate" approach did yield fruitful developments. 
The third generation, of which Mermin claims membership, doesn't seem to have much of an 
opinion one way or the other, regarding the theory as productive and empirical and carrying on. 
But, Mermin points out, when foundational questions are raised, the reaction of this generation 
varies from irritated to bored to plain uncomfortable. 

To this we can now add another generation: those with enough familiarity with the more 
bizarre parts of the theory not to be blase about it, but who take a practical approach by asking: 
What's so special about this theory? How can we characterise it? If a classical ontology for this 
theory doesn't work — if billiard balls and elastic collisions don't describe the microscopic world 

— then what can be wrought from quantum theory in their place? 

One famous illustration of this attitude — albeit from a member of the third generation — 
is the EPR paradox and its resolution by Bell. Bohr's own response to the EPR paper had been 
guarded and had appealed to his own orthodox interpretation; in a way the EPR dilemma was 
likened to counting angels on the head of a pin. The strongest argument against EPR was that 
mutually contradictory experimental setups should not be compared. But with a more pragmatic 
approach. Bell succeeded in eschewing metaphysical arguments in favour of contemplation of the 
theory and reasoning of the first class. This is a valuable lesson which has not gone unlearned 
in the field of information physics. Since quantum physics is an inherently statistical theory, we 
cannot properly engage with it without considering what those statistics mean in the real world 

— both as input and as output from the theory. How do we operationally define a quantum 
state? How much error do we introduce in our preparation, and how much is intrinsic according 
to the theory? These are some of the questions which information physics has addressed. 

Some of the current concerns of quantum information physics have been described in this 
thesis. The compression of mixed states is a minefield, but the curious examples cited in []7^ 
illustrate that much work is still to be done in the mapping of this minefield. The quantification 
of entanglement, classification of qualitatively different types of entanglement, their intercon- 
vertibility and their purification are still major focuses of work in quantum information theory. 

The implications of information physics reach further than questions of compression and 
capacity. An obvious application of this study is to high-precision measurement and to feedback 
control of quantum systems. The third generation of gravitational wave detectors for the LIGO 
experiment is expected to achieve a sensitivity beyond the standard quantum limit for position 
monitoring of a free mass |12|, and novel techniques for measurement will have to be found. For 
example, we could consider the Hamiltonian H{x) of the mass to be controlled by a parameter 
X, and it is our task to determine x from measurements on the mass. In this case we can 
consider the resulting state of the mass (after evolution for a time t) to be one of a set of states 
Px- How much information about x can be extracted from this ensemble? And what is the 
optimal measurement? These are questions addressed by theorems presented here as well as in 
ongoing research into techniques for optimal extraction of information. An illustration of the 
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insight afforded by quantum information theory is the following: suppose two electrons are in 
pure states with their spin polarisation vectors both pointing along the same unknown axis but 
possibly in different directions. Then more information about this unknown direction can be 
extracted from the pair of electrons if they are anti-parallel rather than parallel |116| ] . 

Indeed, ideas from quantum computing can also be of assistance in this. Grover's search 
algorithm was proposed to find some input x to a "black box" binary-valued function / such that 
the output is 1: /(x) = 1. Essentially what is achieved through this algorithm is the extraction of 
global information about the black box Hamiltonian through our addition of controlled terms to 
this Hamiltonian jl^ . Hence through quantum algorithm analysis we can find an optimal driving 
to apply to our measurement system and the measurement to apply in order to characterise the 
unknown Hamiltonian affecting our LIGO HI detector. 

Characterisation of entanglement and information in quantum systems is also important in 
studying quantum chaos |117|. In these cases it becomes important to specify our information 



about a chaotic system and how it behaves under small perturbations. This leads us to introduce 



the concept of algorithmic entropy [118], which is a spin-off of classical computing theory and 
accurately captures the idea of "complexity" of a system. 

Much current work in information theory is aimed at narrowing the gap between the theory 
described in this thesis and available experimental techniques. Quantum computing is a major 
focus of these efforts, with a great deal of ingenuity being thrown at the problem of preserving 
fragile quantum states. Quantum error correction and fault-tolerant quantum computing are two 
of the products of such efforts, and further analysis of particular quantum gates and particular 
computations is proceeding apace. A recent suggestion to simplify a prospective "desk-top" quan- 
tum computer involves supplying ready-made generic quantum states to users who thus require 
slightly less sophisticated and costly hardware |69|; hence quantum computing may ultimately 
also depend on faithful communication of quantum states. 

And of course, the million-dollar question in quantum computing is: What else can we do 
with a quantum computer? Factoring large integers is a nice parlour trick and will enable a user 
to crack most encryption protocols currently used on the internet. But is this enough of a reason 
to try and build one — particularly considering that in the near future quantum key distribution 
may become the cryptographic protocol of choice? Also, while Grover's search algorithm gives 
a handsome speed-up over classical search techniques (and some computer scientists are already 
designing quantum data structures), it will be a long time before Telkom starts offering us 
quantum phone directories. 

There is something for the romantics in quantum information theory too — at least those 
romantics with a taste for foundational physics. The following quote (from is rather a 
dramatic denunciation of quantum computer theorists: 

It will never be possible to construct a 'quantum computer' that can factor a large 
number faster, and within a smaller region of space, than a classical machine would 
do, if the latter could be built out of parts at least as large and as slow as the 
Planckian dimension. 

This statement comes from Nobel laureate Gerard 't Hooft. Are we missing something from 
quantum mechanics? Will quantum mechanics break down, perhaps at the Planck scale? An 
interesting observation of fault-tolerant quantum computing is that the overall behaviour of well- 
designed circuit can be unitary for macroscopic periods of time — perhaps using concatenated 



coding [109] — despite almost instantaneous decoherence and non-unitary behaviour at a lower 
level. Is nature fault-tolerant? Rather than constructing a solar system-sized particle accelerator, 
we would be well-advised to assemble a quantum computer and put the theory through its paces 
in the laboratory. 



4. Conclusion 
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On the other hand, we may be able to deduce, d la Wootters [39|, some principles governing 
quantum mechanics — information-theoretic principles, which constrain the outcomes of our 
proddings of the world. Weinberg, after failed attempts to formulate testable alternatives to 



quantum mechanics, suggested |12]: 



This theoretical failure to find a plausible alternative to quantum mechanics suggests 
to me that quantum mechanics is the way it is because any small changes in quantum 
mechanics would lead to absurdities. 



It's been a century since the discovery of the quantum: what lies in store in the next 100 years? 
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